2  Multiple Regression

Mainly based on Wooldridge (2019), Chapters 3, 9 and Green (2017)

2.1 Topics of this chapter

  • Motivation for a multiple linear regression model, MLR model
  • Interpretation of the MLR model
  • Computation of the coefficients of the MLR model
  • Frisch-Waugh-Lovell (FWL) Theorem
  • Algebraic properties of OLS estimates
  • Standard assumptions for the MLR model
  • Unbiasedness of OLS
  • Variance of OLS
  • Efficiency of OLS - Gauss-Markov Theorem
  • Misspecification
    • Overspecification
    • Underspecification - omitted variable bias
    • Errors in the variables and proxy variables
    • Random coefficient model
    • Functional forms
  • OLS in matrix notation

2.2 Definition of the multiple linear regression model

\text{ }

Explains variable y in terms of variables x_1, x_2, \ldots , x_k

y \ = \ \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \ldots + \beta_k x_k + u \tag{2.1}

y, … dependent variable
x_j, … independent variables, explanatory variables, regressors
\beta_0, … intercept (coefficient of x_0, which is a vector of ones)
\beta_j, \ j=1, \ldots, k, … slope parameters
u, … error term, disturbance -> represents unobservable factors and influences


2.3 Motivation for multiple regression

  • Incorporate more explanatory factors into the model. This is important for testing economic theories and better predictions

  • Explicitly holding fixed other factors that otherwise would end up in u and probably cause a correlation between u and the remaining explanatory variable. If this is actually the case, a causal interpretation of the model is not possible (z_j \rightarrow x \text{ and } z_j \rightarrow y case)

    • Avoid omitted variable bias

    • Can mimic experimental data with random assignment to some extent

  • Allow for more flexible functional (non-linear) forms


2.3.1 Simulation study

To give a more in-depth motivation for multiple regression models we simulate a simple market model with demand and supply function.

  • We want to estimate the market demand function for a certain good, i.e., the demanded quantities in dependence on prices;

Q_D=f(P) \tag{2.2}

  • The true demand function is supposed to be

Q_D = a_1 P + b_1 Y + u, \quad \ a_1<0, \ \ b_1>0 \tag{2.3}

  • So, demand Q_D negatively depends on prices P (mainly substitution effects) and positively on income Y. u represents an error term of unobserved factors which likewise affect demand

  • We additionally have the supply function, that positively depends on prices P and negatively on wages W. v represents an error term of unobserved factors affecting supply Q_S

Q_S = a_2 P + b_2 W + v, \quad \ a_2>0, \ \ b_2<0 \tag{2.4}

  • Because we are interested only in the price effect, we decide to apply a simple regression model with only one explanatory variable – prices. Is the approach promising? As we will see, the answer is no!

Digression: Reduced forms and identification

Assume that the equilibrium condition Q_D=Q_S=Q is always met and rewrite our demand and supply functions in the following way, so that the two endogenous variable Q and P are on the left hand side and the two variables Y and W along with the two error terms u and v remain on the right side. The index i simply indicates that the demand and supply relationships hold for each observation i. 1

Q_i - a_1 P_i \ = \ b_1 Y_i + u_i \tag{2.5}

Q_i - a_2 P_i \ = \ b_2 W_i + v_i \tag{2.6}

Thereby, we assume that the variables Y, W are exogenous – determined outside the model – and have nothing to do with the two error terms; i.e., there exit no other equations which can explain Y, W in terms of the errors u,v. This basically implies that E([Y_i, W_i] \mid u, v)= [0, 0] so that the conditional mean independence assumptions is met.

Equation 2.5 and Equation 2.6 constitute the so called structural model with the structural parameters a_1, a_2, b_1, b_2, which we are interested in and want to estimate.

This market model can be easily written in matrix notation:

\underbrace{\left[\begin{array}{cc} 1 & -a_1 \\ 1 & -a_2 \\ \end{array}\right]}_{\mathbf A} \times \underbrace{\left[\begin{array}{c} Q_i \\ P_i \end{array}\right]}_{\mathbf y_i} \ = \ \underbrace {\left[\begin{array}{cc} b_1 & 0 \\ 0 & b_2 \\ \end{array}\right]}_{\mathbf B} \times \underbrace {\left[\begin{array}{c} Y_i \\ W_i \end{array}\right]}_{\mathbf x_i} \ + \ \underbrace {\left[\begin{array}{c} u_i \\ v_i \\ \end{array}\right]}_{\mathbf e_i} \tag{2.7}

Or shorter:

\mathbf A \, \mathbf y_i \ = \ \mathbf B \, \mathbf x_i \ + \ \mathbf e_i

This matrix equation can be solved by multiplying the equation from the left with the inverse of matrix \mathbf A. (Remember that \mathbf A^{-1}\mathbf A = \mathbf I.)

\mathbf y_i \ = \ \mathbf A^{-1} \mathbf B \, \mathbf x_i \ + \ \mathbf A^{-1} \mathbf e_i \tag{2.8}

Or more compact:

\mathbf y_i \ = \ \mathbf \Pi \, \mathbf x_i \ + \ \mathbf A^{-1} \mathbf e_i \tag{2.9}

Equation 2.9 is the so-called Reduced Form of the model. This shows how the endogenous variables in \mathbf y_i (in our case [Q_i, P_i]') solely depend on the exogenous variables in \mathbf x_i (in our case [Y_i, W_i]') and on the exogenous error term \mathbf e_i ([u_i, v_i]' in our case).

The reduced form represents the so-called data generating process because the model is driven by the values of the exogenous variables and thereby generates the values of the endogenous variables.

  • Thus, the reduced form of the model is used to simulate the data for our little simulation study in the text. Thereby, the exogenous variables Y_i, W_i, u_i, v_i are “replaced” with random vectors and Equation 2.9, respectively Equation 2.10 are used to calculate the corresponding values for Q_i, P_i

Important note: The reduced form parameter matrix \mathbf \Pi can be consistently estimated by OLS as all right hand side variables in Equation 2.9 are exogenous. Hence, the conditional mean independence assumption is always met for a (“true”) reduced form by assumption.

Identification

If the model is just identified, the structural parameters of Equation 2.7 can be recovered from the estimated reduced form parameter matrix \mathbf \Pi.

  • In particular, for our market model, the reduced form is:

\left[\begin{array}{c} Q_i \\ P_i \end{array}\right]\ = \ \left[\begin{array}{cc} 1 & -a_1 \\ 1 & -a_2 \\ \end{array}\right]^{-1} \left[\begin{array}{cc} b_1 & 0 \\ 0 & b_2 \\ \end{array}\right] \times \left[\begin{array}{c} Y_i \\ W_i \end{array}\right] \ + \ \left[\begin{array}{cc} 1 & -a_1 \\ 1 & -a_2 \\ \end{array}\right]^{-1}\left[\begin{array}{c} u_i \\ v_i \\ \end{array}\right] \tag{2.10}

  • The inverse of \mathbf A is:

\left[\begin{array}{cc} 1 & -a_1 \\ 1 & -a_2 \\ \end{array}\right]^{-1} = \ \frac {1}{a_1-a_2} \left[\begin{array}{cc} -a_2 & a_1 \\ -1 & 1 \\ \end{array}\right]

  • This leads to the reduced form parameter matrix \mathbf \Pi = \mathbf A^{-1} \mathbf B:

\mathbf \Pi \ = \ \frac {1}{a_1-a_2} \left[\begin{array}{cc} -a_2 & a_1 \\ -1 & 1 \\ \end{array}\right] \times \left[\begin{array}{cc} b_1 & 0 \\ 0 & b_2 \\ \end{array}\right] \ = \ \frac {1}{a_1-a_2} \left[\begin{array}{cc} -a_2 b_1 & a_1b_2 \\ -b1 & b_2 \\ \end{array}\right] \tag{2.11}

  • To recover the structural parameters a_1, a_2, b_1, b_2, we utilize the fact that the formula above implies a relationship for every of the four elements of the estimated \hat {\mathbf \Pi}:

\hat \pi_{11} = -\dfrac {1}{a_1-a_2} a_2b_1 \quad \quad \hat \pi_{12} = \dfrac {1}{a_1-a_2} a_1b_2

\hat \pi_{21} = -\dfrac {1}{a_1-a_2} b_1 \quad \quad \hat \pi_{22} = \dfrac {1}{a_1-a_2} b_2 \\

  • These are four equations in the four structural parameters a_1, a_2, b_1, b_2. The solution for this non-linear system of equations is:

a_1 = \frac{\hat \pi_{12}}{\hat \pi_{22}} \quad \quad a_2 = \frac{\hat \pi_{11}}{\hat \pi_{21}}

b_1 = \hat \pi_{11} - \frac{\hat \pi_{12}\hat \pi_{21}}{\hat \pi_{22}} \quad \quad b_2 = \hat \pi_{12} - \frac{\hat \pi_{11}\hat \pi_{22}}{\hat \pi_{21}}

Note that we had imposed two zero-restrictions on the matrix \mathbf B. In particular, these restrictions mean that variable W_i is not part of the demand function and Y_i is not part of the supply function.

If these zero-restrictions are not imposed, we would have to estimate two additional structural parameters which rises the number of structural parameters to six. But six parameters cannot be recovered from four equations in a unique manner. Hence, the structural form would not be uniquely determined by the reduced form.

This means that there exist other observational equivalent structural forms which have the same reduced form – the same data generating process, which generates the very same observations of the endogenous variables Q and P. Therefore, the structural model is not identified (estimable) in this case.

To demonstrate this very important issue, suppose we multiply the original structural model Equation 2.7 with an invertible, but otherwise arbitrary transformation matrix T, \ \text { with } \ t_1 \cdot t_2 \neq 1:

\underbrace{\left[\begin{array}{cc} 1 & t_1 \\ t_2 & 1 \\ \end{array}\right]}_{\mathbf T} \ \underbrace{\left[\begin{array}{cc} 1 & -a_1 \\ 1 & -a_2 \\ \end{array}\right]}_{\mathbf A} \ \underbrace{\left[\begin{array}{c} Q_i \\ P_i \end{array}\right]}_{\mathbf y_i} \ = \ \underbrace{\left[\begin{array}{cc} 1 & t_1 \\ t_2 & 1 \\ \end{array}\right]}_{\mathbf T} \ \underbrace {\left[\begin{array}{cc} b_1 & 0 \\ 0 & b_2 \\ \end{array}\right]}_{\mathbf B} \ \underbrace {\left[\begin{array}{c} Y_i \\ W_i \end{array}\right]}_{\mathbf x_i} \ + \ \underbrace{\left[\begin{array}{cc} 1 & t_1 \\ t_2 & 1 \\ \end{array}\right]}_{\mathbf T} \ \underbrace {\left[\begin{array}{c} u_i \\ v_i \\ \end{array}\right]}_{\mathbf e_i}

Or more compact:

\mathbf A^* \, \mathbf y_i \ = \ \mathbf B^* \, \mathbf x_i \ + \ \mathbf e^*_i \tag{2.12}

with:

\mathbf A^* \ := \ \mathbf T \mathbf A \ = \ \left[\begin{array}{cc} (1+t_1) & (-a_1-t_1a_2) \\ (1+t_2) & (-t_2a_1-a_2) \\ \end{array}\right]

\mathbf B^* \ := \ \mathbf T \, \mathbf B \ = \ \left[\begin{array}{cc} b_1 & t_1b_2 \\ t_2b_1 & b_2 \\ \end{array}\right] \tag{2.13}

\mathbf e^*_i \ := \ \mathbf T \, \mathbf e_i \ = \ \left[\begin{array}{c} u_i+t_1v_i & \\ v_i+t_2u_i \\ \end{array}\right]

Equation 2.12 is a new and different structural model, with different demand and supply functions.2

The important point is that the two different model structures have the same reduced form and thus exhibit the same data generating process yielding the very same observations – they are observational equivalent.

To proof this claim we calculate the reduced form of the transformed model Equation 2.12

\mathbf y_i \ = \ \mathbf A^{*-1} \mathbf B^* \mathbf x_i \ + \ \mathbf A^{*-1} \mathbf e^*_i \ =

(\mathbf T \mathbf A)^{-1} \mathbf T \, \mathbf B \, \mathbf x_i \ + \ (\mathbf T \mathbf A)^{-1} \mathbf T \, \mathbf e_i \ = \tag{2.14}

\mathbf A^{-1} \mathbf T^{-1} \mathbf T \, \mathbf B \, \mathbf x_i \ + \ \mathbf A^{-1} \mathbf T^{-1} \mathbf T \, \mathbf e_i \ =

\mathbf A^{-1} \mathbf B \, \mathbf x_i \ + \ \mathbf A^{-1} \mathbf e_i

This is identical to Equation 2.8, which was the reduced form of the original structural model. This proofs the observational equivalence of the original and the alternative (transformed) structure. (To derive this result we used rule Equation B.4 and the fact that \mathbf T^{-1} \mathbf T = \mathbf I.)

Note that the two zero-restrictions in the \mathbf B matrix disappeared in the \mathbf B^* matrix of the new structure, see Equation 2.13. This means that the applied transformation is not allowed if we strongly believe that the two zero-restrictions are true and a feature of the true structure. Hence, these are identifying restrictions, ruling out such a transformation of the model.

Corollary: Our analysis showed that we always need some identifying restrictions to be able to estimate a structural equation as otherwise, it is always possible to find alternative models, which are observational equivalent.

Remark: It is clearly possible that some equations of the structural model are identified and some not. For instance, if we impose the zero-restriction for the demand function and not for the supply function, only the demand function would be identified (estimable).


Simulating the market model

  • First, we compute simulated data according to the specified market model above with random numbers for the exogenous variables and error terms Y, W, u, v

    • We define a_1=-1, b_1=1, a_2=1, b_2=-1, sd(Y)=3 and sd(W)=2, sd(u)=0.1 and sd(v)=1

    • Remark: Thereby, we employed an (incredible) identifying restriction to be able to estimate the demand function with OLS. The used restriction is that the variance of the error term u is very small. This assumption prevents a noticeable feedback mechanism which would lead to a violation of the conditional mean independence assumption in the demand equation 3

Code
# Simulating data for the market model

# Demand : Q = a1*P + b1*Y + u
# Supply : Q = a2*P + b2*W + v

# y = (Q, P)', x = (Y, W)', e = (u, v)'

# In matrix language: A*y = B*x + e   =>   
# The solution of the model is:
# y = A^-1*B*x + A^-1*e
# which is the so called reduced form, or data generating process

# Structural parameters
a1=-1
a2=1
b1=1
b2=-1

my = 20 # mean of Y
mw = 5  # mean of W

# Model matrices
A = rbind( c(1, -a1), c(1, -a2) )
B = rbind( c(b1, 0), c(0, b2) )

# Many firms over many periods (pooled cross section)
n=1500
u = c( rnorm(n, 0, 0.1) )  # very small variance of u 
v = c( rnorm(n, 0, 1) )   
Y = c( rnorm(n, my, 3) ) 
W = c( rnorm(n, mw, 2) ) 

x = rbind(Y, W)
e = rbind(u, v)

# Model solution 
y = solve(A) %*% B %*% x + solve(A) %*% diag(2) %*% e

# Simulated Q and P
Q = y[1, ]
P = y[2, ]

# Factor variable for Y
Y_ = cut( Y, c(0, 13,14,15,16,17,18,19,20,21,22,23,24,25,26,27, 50) )

Estimating the true model of demand

  • Afterwards, we employ a multiple regression model to estimate the demand equation (Equation 2.3) which yields nearly perfect parameter estimates for the true values a_1=-1 and b_1=1 (because we have 1000 simulated observations).
library(AER) # to use `coeftest()`
# True model of demand, with true price effect equal to -1 
# and true income effect equal to 1.  
# The multiple regression model yields nearly prefect parameter estimates.
coeftest( lm(Q ~ P + Y) ) 
       
       t test of coefficients:
       
                     Estimate Std. Error  t value Pr(>|t|)    
       (Intercept) -0.0162019  0.0184326   -0.879   0.3796    
       P           -0.9936583  0.0023127 -429.654   <2e-16 ***
       Y            0.9967699  0.0014447  689.926   <2e-16 ***
       ---
       Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Estimating the simple (misspecified) model of demand

  • Now, we employ the simple regression model, Q = c_0 + c_1 P + u
# Simplified model of demand, ignoring income Y 
# We are finding a wrong positive (!) price effect 
DS <- lm(Q ~ P)
coeftest( DS ) 
       
       t test of coefficients:
       
                   Estimate Std. Error t value  Pr(>|t|)    
       (Intercept) 3.924686   0.312891  12.543 < 2.2e-16 ***
       P           0.282967   0.024769  11.424 < 2.2e-16 ***
       ---
       Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  • This yields a strongly biased price effect! Why?

    • The simple regression model ignores income Y, which ends up in the error term u

      • But income is positively correlated with prices P: Higher income \rightarrow more demand and therefore higher prices, leading to a violation of the conditional mean independence assumtion in the demand function: We have E(u(Y) \mid P) > 0
    • Hence, we are finding a wrong positive (!) price effect. The true partial effect is -1

  • Actually, we estimate a sort of mixture of the demand and supply function

  • What is the reason for this result. The following Figure 2.1 should shed some light on the problem


Plotting the simulated data

  • Panel a) of Figure 2.1 shows a standard market model with a demand function (in red) and a supply function (in blue). The observed data of Q and P are the result of the intersections of the demand and supply functions – market equilibrium points. These functions vary because of changing values of income Y and wages W and the error terms u and v

    • Note that the realized data points do not mimic the supply nor the demand function
  • Panel b) of Figure 2.1 shows the regression line of a simple regression model of Q on P, which we just estimated above. The slope value of 0.2829674 does in no way represent the parameter a_1 of the true demand function Equation 2.3


Code
library(ggplot2)

# FWL-theorem

# demand equation
out_qy <- lm(Q ~ Y)
res_qy <- residuals(out_qy)

out_py <- lm(P ~ Y)
res_py <- residuals(out_py)

outQP_y <- lm(res_qy ~ res_py)


# supply equation
out_qw <- lm(Q ~ W)
res_qw <- residuals(out_qw)

out_pw <- lm(P ~ W)
res_pw <- residuals(out_pw)

outQP_r <- lm(res_qw ~ res_pw)


# Plotting

ggplot( mapping=aes(x=P, y=Q))  +
  geom_point() +
  geom_abline(intercept=b1*my-(-4:4), slope=a1, color="#cd0000", linewidth=0.4, linetype = "dashed") +
  geom_abline(intercept=b1*my, slope=a1, color="#cd0000", linewidth=1) +
  geom_abline(intercept=b2*mw-(-4:4), slope=a2, color="#0066cc", linewidth=0.4, linetype = "dashed") +
  geom_abline(intercept=b2*mw, slope=a2, color="#0066cc", linewidth=1) +
  theme_bw()

ggplot( mapping=aes (x=P, y=Q))  +
  geom_point( ) +
  geom_smooth(method=lm, se=F, linewidth=1, color="#aa55aa") +
  theme_bw()

ggplot( mapping=aes (x=P, y=Q) )  +
  geom_point( aes(color=Y_ ) ) +
  geom_smooth( aes(color=Y_ ), method=lm, se=F, linewidth=1) +
  geom_abline(intercept=b1*my, slope=a1, color="#cd0000", linewidth=1) +
  annotate("text", x = 12.3, y = 13, label = "Demand functions Q(P, Y) for given Y", size=6, color="#4488AA") +
  theme_bw()

ggplot( mapping=aes (x=res_py, y=res_qy) )  +
  geom_point( aes(color=Y_ ) ) +
  geom_smooth( method=lm, se=F, linewidth=1,   color="#dd0000") +
  theme_bw()

(a) Market equilibrium points – intersection points of shifting supply (blue) and demand (red) function (because of varying Y, W, u, v). These are the points (Q, P) we observe

(b) Regression line lm(Q ~ P) through the data points. But this is not the demand function we look for – we get a spurious positive effect of P on Q

(c) Regression lines lm(Q ~ P | Y) for several given income levels Y, which are colored differently. This shows the true partial effect of P on demand Q for a given income. The true demand function for the mean income is in red

(d) True demand relationship: Multiple regression model, holding income constant with the help of the FWL-theorem, see Section 2.5

Figure 2.1: Model of supply and demandd. Every observation results from an intersection of the supply and demand function (market equilibrium). Supply and demand functions vary because of variations of income, wages and unobsevable shocks in supply and demand (error terms)


  • In panel c) we are plotting the same data as in b) but highlighting different levels of income with different colors. We further draw separate regression lines for the different income levels

    • This reveals the true negative (!) demand-relationship between quantity demanded and prices, given certain levels of income
  • Panel d) shows the demand relationship for constant income, which mimic the true price effect on demand. We accomplished this by purging the quantity and price data from effects of income – controlling for income. This is an application of the very important FWL-Theorem discussed latter in Section 2.5

  • To demonstrate the relationship between quantities and prices shown in panel c) we estimate a simple regression model, but only for a very narrow range of income. This procedure leads to a correct estimation of a_1 in a simple regression framework, as shown in panels c) for several different given income ranges

# We regress Q on P, but only for the subset that Y is from the interval [19.8, 20.2]
coeftest( lm(Q ~ P, subset = (Y>19.8 & Y<20.2) ) )
       
       t test of coefficients:
       
                    Estimate Std. Error t value  Pr(>|t|)    
       (Intercept) 19.816581   0.181806 108.999 < 2.2e-16 ***
       P           -0.985611   0.014562 -67.682 < 2.2e-16 ***
       ---
       Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This fact demonstrates that the estimated coefficients of a multiple regression model represents the effect of one explanatory variable on the dependent variable, holding the other explaining variables at a constant level.


2.3.2 Empirical example: Wage equation

Measuring the effect of education on wages, but this time explicitly holding experience fixed

The model: \log(wage) \ = \ \beta_0 + \beta_1 educ + \beta_2 exper + u
\text{ }

library(AER)  # AER is a general econometric package 
library(wooldridge)  # the wooldridge package includes all data sets used in the textbook  
data(wage1)  # loading data set wage1 of the wooldridge  package  
# regressing log(wage) on educ, storing results in list model1

model1 <- lm(lwage ~ educ, data = wage1)
# regressing log(wage) on educ and exper, storing results in list model2

model2 <- lm(lwage ~ educ + exper, data = wage1)

Showing estimation results
# show estimated parameters for:
model1
      
      Call:
      lm(formula = lwage ~ educ, data = wage1)
      
      Coefficients:
      (Intercept)         educ  
          0.58377      0.08274
model2
      
      Call:
      lm(formula = lwage ~ educ + exper, data = wage1)
      
      Coefficients:
      (Intercept)         educ        exper  
          0.21685      0.09794      0.01035

Interpretation of the multiple regression model

  • The multiple linear regression model manages to hold the values of other explanatory variables fixed even if, in reality, they are correlated with the explanatory variable under consideration

  • Hence, we have a “ceteris paribus”-interpretation

  • It has still to be assumed that unobserved factor in u do not change as a result of changing explanatory variables (conditional mean independence assumption)

  • In our particular wage equation example, the effect of one year more education is 0.083 on log(wage), or (times 100) 8.3% on wage, if other effects are not controlled for. But in this case, some of the effect could stem from more experience, which is possibly negative correlated with education

  • If we additionally include a variable for experience, exper, the effect of education on wage increases to 9.8%. That means:
    If we compare two persons with the same level of experience, but person A has one year more education than person B, we predict person A to have a wage that is 9.8% higher than that of person B

  • Hence, in some sense, a multiple regression modelsimulatesa controlled experiment by holding constant other relevant factors

  • The error term contains all the other variables which we do not control for (maybe because we have no data; e.g., ability in our wage equation which probably has an important impact on wages)


Interpretation of coefficients of the multiple regression model

  • If the depended variable is in logs, we have a semi elasticity

\beta_j = \dfrac {\partial \log(y)}{\partial x_j} = \dfrac {\partial y/y}{\partial x_j}

  • This measures how many percent y changes (for instance 0.05) if x_j changed by one unit (we have to multiply \beta_j by 100, to get the percentage value 5%)

  • If both the depended and explanatory variable are in logs, we arrive to an elasticity, which is measured in percent

\beta_j = \dfrac {\partial \log(y)}{\partial \log(x_j)} = \dfrac {\partial y/y}{\partial x_j/x_j}

  • If only the explanatory variable is in logs, we have

\beta_j = \dfrac {\partial y}{\partial \log(x_j)} = \dfrac {\partial y}{\partial x_j/x_j}

  • This measures how many units does the dependent variable change if \log(x_j) is increased by one unit. So, you have to divide \beta_j by 100, to get the effect of an 1% increase in x_j on y

Non-linearities

# plot of effects of exper on lwage
plot(wage1$lwage ~ wage1$exper)


  • There seem to be a nonlinear effect of experience on wages

  • So, we try the model specification:

\log(wage) \ = \ \beta_0 + \beta_1 educ + \beta_2 exper + \beta_3 exper^2 + u

# modeling a non-linear effect of exper
model3 <- lm(lwage ~ educ + exper + I(exper^2), data = wage1)
model3
      
      Call:
      lm(formula = lwage ~ educ + exper + I(exper^2), data = wage1)
      
      Coefficients:
      (Intercept)         educ        exper   I(exper^2)  
        0.1279975    0.0903658    0.0410089   -0.0007136

In this particular case, the partial effect of exper is

\dfrac{\partial lwage}{\partial exper} = \beta_2 + 2\beta_3exper = 0.041 - 0.00143 exper

  • Hence, this partial effect is not constant but is a function of exper. Often, these partial effects are calculated

    • at the average of the explanatory variables - Partial Effect at Average, PEA

    • or as the effect averaged over all observations, APE

  • Partial effects are sometimes inconvenient to calculate, especially with products of variables - interaction terms. But there are packages in R, for instance marginaleffects, which calculates the average partial effects and present these in tables or graphically

# Calculating the APE for the model above with "marginaleffects"
library(marginaleffects)
avg_slopes(model3)
      
        Term Estimate Std. Error     z Pr(>|z|)     S  2.5 % 97.5 %
       educ    0.0904    0.00747 12.10   <0.001 109.6 0.0757 0.1050
       exper   0.0167    0.00182  9.17   <0.001  64.1 0.0131 0.0203
      
      Type:  response 
      Comparison: dY/dX

# with "plot_slopes" from the package "marginaleffects" we can plot the partial effect from above
# depended on "exper"
plot_slopes(model3, variable = "exper", condition = "exper") + theme_bw()


# with "plot_predictions" from the package "marginaleffects" we can plot the partial effect on the 
# predicted values of wages depended on "exper"
plot_predictions(model3, condition="exper") + theme_bw()


2.4 Computing the coefficients of the multiple regression model

  • As in the simple regression model we assume a random sample of y_i and x_{i,j}

  • We define the regression residuals: \hat u_i := \ y_i - \hat \beta_0 - \hat \beta_1 x_{i,1} - \hat \beta_2 x_{i,2} - \ldots - \hat \beta_k x_{i,k} \tag{2.15}

  • We choose \hat \beta_0, \hat \beta_1, \hat \beta_2, \ldots \hat \beta_k so that the sum of squared residuals is minimized

\underset{\hat{\beta}}{\min } \sum_{i=1}^{n} \hat u_i^2 \ \rightarrow \ \hat \beta_0, \hat \beta_1, \hat \beta_2, \ldots \hat \beta_k

  • This leads to the following formula for \hat \beta_1

\hat{\beta}_{1} \, = \, \frac{ \sum_{i=1}^{n} \hat{r}_{i,1} \ y_{i}}{\sum_{i=1}^{N} \hat{r}_{i,1}^{2}} \tag{2.16}

  • If we compare this formula with the one for the single regression model, we notice that (x_i - \bar x) is now replaced with \hat r_{i,1}

  • Thereby, \hat r_{1} are the residuals of a regression of x_{1} on all the other explanatory variables of the model, x_0, x_{2}, \ldots , x_{k}

    • This means, only the part of x_{1} that is uncorrelated with all other x_{2}, \ldots , x_{k} is relevant for estimation of \beta_1
  • Formula from Equation 2.16 shows that the multiple regression model from Equation 2.1 is equivalent to a transformed, single regression model

y \ = \ \beta_0 + \beta_1 \hat r_{1} + v \tag{2.17}

  • This way, we get the partial effect of x_1 on y, holding constant (or controlling for the effects of) all other x-es

  • This fact is the result of an important general theorem, the Frisch-Waugh-Lovell (FWL) Theorem


2.4.1 Matrix notation of the multiple regression model

The model

y_1 \ = \ \beta_0 + \beta_1 x_{1,1} + \beta_2 x_{1,2} + \ldots + \beta_k x_{1,k} + u_1 \\ \colon \quad \quad \quad \quad \quad \quad \quad \colon \quad \quad \quad \quad \quad \quad \quad \colon \\ y_n \ = \ \beta_0 + \beta_1 x_{n,1} + \beta_2 x_{n,2} + \ldots + \beta_k x_{n,k} + u_n

is stacked to

\mathbf{y} = \mathbf{X} \boldsymbol{\beta} + \mathbf{u} \tag{2.18}

with

\mathbf{y} = \underbrace{ \left[\begin{array}{c} y_{1} \\ y_{2} \\ \vdots \\ y_{n} \end{array}\right]}_{n \ \times \ 1}, \ \ \mathbf{X} = \underbrace{ \left[\begin{array}{ccccc} 1 & x_{1,1} & x_{1,2} & \cdots & x_{1, k} \\ 1 & x_{2,1} & x_{2,2} & \cdots & x_{2, k} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_{n, 1} & x_{n, 2} & \cdots & x_{n, k} \end{array}\right]}_{n \ \times \ (k+1) }, \ \ \boldsymbol{\beta} = \underbrace{ \left[\begin{array}{c} \beta_{0} \\ \vdots \\ \beta_{k} \end{array}\right]}_{(k+1) \ \times \ 1}, \ \ \mathbf{u} = \underbrace{ \left[\begin{array}{c} u_{1} \\ u_{2} \\ \vdots \\ u_{n} \end{array}\right]}_{n \ \times \ 1}


Principle of ordinary least square:

For calculating the residuals insert an estimate of \boldsymbol \beta in Equation 2.18

\mathbf{y} = \mathbf{X} \boldsymbol{\hat \beta} + \mathbf{\hat u} The residuals are then defined as:

\mathbf{\hat u} \ = \ \mathbf y - \mathbf{X} \boldsymbol{\hat \beta} \ = \ (\mathbf y - \mathbf{\hat y})

The sum of the squared residuals is:

SSR := \ \sum_{i=1}^{n}\left(\hat{u}_{i}\right)^{2} \ =\ \hat{\mathbf{u}}^{\prime} \hat{\mathbf{u}} \ = \ (\mathbf{y}-\mathbf{X} \hat{\boldsymbol{\beta}})^{\prime}(\mathbf{y}-\mathbf{X} \hat{\boldsymbol{\beta}}) \ =

\mathbf{y}^{\prime} \mathbf{y}-2 \hat{\boldsymbol{\beta}}^{\prime} \mathbf{X}^{\prime} \mathbf{y}+\hat{\boldsymbol{\beta}}^{\prime} \mathbf{X}^{\prime} \mathbf{X} \hat{\boldsymbol{\beta}} \\

as \hat{\boldsymbol{\beta}}' \mathbf{X}' \mathbf{y} is scalar and hence equal to (\hat{\mathbf{\beta}}^{\prime} \mathbf{X}' \mathbf{y})' = \mathbf{y}' \mathbf{X} \hat{\boldsymbol{\beta}}.

For minimization of the sum of squared residuals set the first derivative with respect to \boldsymbol{\hat \beta} to zero:

\underset{\hat{\beta}}{\min }(S S R): \ \frac{\partial S S R}{\partial \hat{\boldsymbol{\beta}}} \ = \ \frac{\partial\left(\mathbf{y}^{\prime} \mathbf{y}-2 \hat{\boldsymbol{\beta}}' \mathbf{X}^{\prime} \mathbf{y}+\hat{\boldsymbol{\beta}}' \mathbf{X}^{\prime} \mathbf{X} \hat{\boldsymbol{\beta}}\right)}{\partial \hat{\boldsymbol{\beta}}} \ = 0 \ \ \Rightarrow

-2 \mathbf{X}' {\mathbf{y}}+2 \mathbf{X}^{\prime} \mathbf{X} \hat{\boldsymbol{\beta}}=0


Solution for the parameter vector

This gives the so called normal equations:

\mathbf{X}^{\prime} \mathbf{X} \hat{\boldsymbol{\beta}} = \mathbf{X}' \mathbf{y} \\

The solution for the parameter vector \hat{\boldsymbol{\beta}} therefore is:

\hat{\boldsymbol{\beta}}=(\mathbf{X}' \mathbf{X})^{-1} \mathbf{X}' \mathbf{y} \tag{2.19}

Thereby,

\mathbf{X}^{\prime} \mathbf{X}=\left[\begin{array}{cccc} 1 & 1 & \cdots & 1 \\ x_{1,1} & x_{2,1} & \cdots & x_{n, 1} \\ \vdots & \vdots & \ddots & \vdots \\ x_{1, k} & x_{2, k} & \cdots & x_{n, k} \end{array}\right] \cdot\left[\begin{array}{cccc} 1 & x_{1,1} & \cdots & x_{1, k} \\ 1 & x_{2,1} & \cdots & x_{2, k} \\ \vdots & \vdots & \ddots & \vdots \\ 1 & x_{n, 1} & \cdots & x_{n, k} \end{array}\right]=

=\left[\begin{array}{cccc} N & \sum_{i} x_{i, 1} & \cdots & \sum_{i} x_{i, k} \\ \sum_{i} x_{i, 1} & \sum_{i} x_{i, 1}^{2} & \cdots & \sum_{i} x_{i, 1} x_{i, k} \\ \sum_{i} x_{i, 2} & \sum_{i} x_{i, 2} x_{i, 1} & \ddots & \sum_{i} x_{i, 2} x_{i, k} \\ \vdots & \vdots & & \vdots \\ \sum_{i} x_{i, k} & \sum_{i} x_{i, k} x_{i, 1} & \cdots & \sum_{i} x_{i, k}^{2} \end{array}\right] \ = \ \sum_{i=1}^n \mathbf x_i' \mathbf x_i

and

\mathbf{X}^{\prime} \mathbf{u}=\left[\begin{array}{cccc} 1 & 1 & \cdots & 1 \\ x_{1,1} & x_{2,1} & \cdots & x_{n, 1} \\ \vdots & \vdots & \ddots & \vdots \\ x_{1, k} & x_{2, k} & \cdots & x_{n, k} \end{array}\right] \cdot\left[\begin{array}{c} u_{1} \\ u_{2} \\ \vdots \\ u_{n} \end{array}\right]=\left[\begin{array}{c} \sum_{i} u_{i} \\ \sum_{i} x_{i, 1} u_{i} \\ \vdots \\ \sum_{i} x_{i, k} u_{i} \end{array}\right] \ = \ \sum_{i=1}^n \mathbf x_i' u_i \\

with \mathbf x_i being the i^{th} row of the data matrix \mathbf X \, (so, this is a 1 \times (k+1) row vector!). \, \mathbf x_i' \mathbf x_i therefore is a outer product and hence a (k+1) \times (k+1) matrix and \,\mathbf x_i' u_i is a (k+1) \times 1 vector.


Estimator for the variance of the error term: \hat \sigma^2:

\hat \sigma^2 \ = \ \dfrac{1}{n-k-1} \sum_{i=1}^n \hat u_i^2 \ = \ \dfrac{\mathbf {\hat u}' \mathbf {\hat u}} {n-k-1} \tag{2.20}


We have the population model for every observation, with \mathbf x_i being a row vector with the (k+1) explanatory variables:

y_i = \mathbf{x}_i \boldsymbol{\beta} + u_i

Additionally, we have assumption 4 (MLR.4): E (\mathbf u_i \mid \mathbf X) = 0, which implies:

E(\mathbf x_i' u_i) \ = \ E_x [E(\mathbf x_i' u_i \mid \mathbf X)] \ = \ E_x [\mathbf x_i' \, E( u_i \mid \mathbf X)] \ = \ E_x [\mathbf x_i' \, E( 0)] \ = \ \mathbf 0

This, furthermore implies:

E(\mathbf x_i' u_i) \ = \ E \left(\mathbf x_i' ( y_i - \mathbf x_i \boldsymbol \beta) \right) \ = \

E(\mathbf x_i' y_i) - E(\mathbf x_i' \mathbf x_i \boldsymbol) \boldsymbol \beta \ = \ \mathbf 0

\Rightarrow \ \ \boldsymbol \beta \ = \ E(\mathbf x_i' \mathbf x_i)^{-1} E(\mathbf x_i' y_i)

If we have a sample of n observations we can implement these (k+1) moment restrictions by replacing the population (theoretical) moments with their sample counterparts to estimate the (k+1) parameters

\boldsymbol{\hat \beta} \ = \ \left( \frac{1}{n} \sum_{i=1}^n \mathbf x_i' \mathbf x_i \right )^{-1} \left( \frac{1}{n} \sum_{i=1}^n \mathbf x'_i y_i \right ) \ = \ \left( \sum_{i=1}^n \mathbf x_i' \mathbf x_i \right )^{-1} \left( \sum_{i=1}^n \mathbf x'_i y_i \right ) \ = \ (\mathbf{X}' \mathbf{X})^{-1} \mathbf{X}' \mathbf{y}

In this case, the Methods of Moments (MoM) estimator is the same as the OLS estimator. This is due to our standard assumption MLR.4.


2.5 The Frisch-Waugh-Lovell (FWL) Theorem

  • Equation 2.17 above shows that alternatively, one can estimate the coefficient of any explanatory variable of a multiple regression within a simple regression model by a two steps procedure:

    • 1 Regress the explanatory variable – in our case x_1 – on all other explanatory variables including the intercept

    • 2 Regress y on the residuals from this regression (as in Equation 2.17)

    • Imortant remark: If we additionally regress y on the other explanatory variables form above and use the residuals of this regression to replace y in Equation 2.17, we get the same estimate of \hat \beta_1.
      The advantage of this variant is that in this case the residuals of Equation 2.17 are identically to those of Equation 2.1

  • This is the famous and very important FWL Theorem

  • Why does this procedure work?

    • The residuals from the first regression is the part of the explanatory variable x_1 that is uncorrelated with the other explanatory variables

    • The slope coefficient of the second regression (representing Equation 2.17) therefore yields the isolated effect of the explanatory variable on the depended variable, because x_1 is purged from any indirect effects of the other variables


FWL Theorem - an example

We regress Log(wage) on educ, exper and expersq. First, we estimate the usual multiple regression model

# loading a librariy which allows to print the results of several regressions in one table
library(modelsummary)

# We use the wage1 data from the Wooldridge package
data(wage1)
# original model, we are interested in the coef of educ

out <- lm(lwage ~ educ + exper + expersq, data = wage1)
out
      
      Call:
      lm(formula = lwage ~ educ + exper + expersq, data = wage1)
      
      Coefficients:
      (Intercept)         educ        exper      expersq  
        0.1279975    0.0903658    0.0410089   -0.0007136

Applying the FWL theorem

Now, we want to estimate the coefficient of educ via a single regression, applying the FWL theorem. So, educ is x_1 and the other explanatory variables are the intercept, exper, expersq.

  • Hence, in the first step we regress educ on [intercept,exper expersq]
# regressing educ on exper + expersq

out3 <- lm(educ ~ exper + expersq, data = wage1)
educ_res = out3$residuals
  • Additionally, we regress the dependent variable lwage on [intercept,exper expersq] and store the residuals in lwage_res. This second step is not necessary, but useful if we want to have the residuals of the original model and carry out t-tests
# regressing lwage on exper + expersq 

out2 <- lm(lwage ~ exper + expersq, data = wage1)
lwage_res = out2$residuals
  • In the third step, we regress the residuals lwage_res on the residuals educ_res 4. The coefficient of educ_res in this regression model should be equal the coefficient of educ in the original regression model.
# regressing lwage_res on educ_res 

out4 <- lm(lwage_res ~ 0 + educ_res)
out4
      
      Call:
      lm(formula = lwage_res ~ 0 + educ_res)
      
      Coefficients:
      educ_res  
       0.09037
  • As we see, the coefficients are identically. The same is true for standard errors, t-values and residual variance

    • Actually, they differ a tiny bit, because the two step procedure does not take into account that we lost degrees of freedom in the first step. But asymptotically, this is irrelevant.

Code
# Using 'modelsummary' for a pretty output

modelsummary(list("Multiple Regression"=out, "Simple Regression"=out4),
            stars = TRUE, 
            fmt = 4,
            statistic = c('statistic'),
            gof_omit = "A|L|B|F",
            align = "ldd",
            output = "flextable"
            )
Table 2.1:

Aplying the FWL Theorem (t-statistics in brackets)

Multiple Regression

Simple Regression

(Intercept)

    0.1280   

   (1.2083)  

educ

    0.0904***

  (12.1004)  

exper

    0.0410***

   (7.8916)  

expersq

   -0.0007***

  (-6.1639)  

educ_res

   0.0904***

 (12.1351)  

Num.Obs.

  526       

 526       

R2

    0.300    

   0.219    

RMSE

    0.44     

   0.44     

+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001

2.5.1 FWL Theorem in matrix notation

Consider a regression model with two groups of regressors. We are especially interested in the effects of \, \mathbf X_1 on \mathbf y, i.e. on \boldsymbol{\hat\beta}_1

\mathbf{y} = \mathbf{X}_1 \boldsymbol{\hat\beta}_1 + \mathbf{X}_2 \boldsymbol{\hat\beta}_2 + \mathbf{\hat u} \tag{2.21}

  • Define the so called orthogonal projection matrix or residual maker matrix \, \mathbf M_2 as

\mathbf M_2 := \ ( \mathbf I - \underbrace{\mathbf X_2(\mathbf X_2' \mathbf X_2)^{-1}\mathbf X_2'}_{:= \ \mathbf P_2}) = (\mathbf I - \mathbf P_2) \tag{2.22}

  • Define the so called projection matrix or hat matrix \, \mathbf P_2 as

\mathbf P_2 := \ \mathbf X_2(\mathbf X_2' \mathbf X_2)^{-1}\mathbf X_2' \tag{2.23}

  • Multiplying \mathbf P_2 with an appropriately dimensioned arbitrary vector \mathbf z yields a vector of predicted values \mathbf{\hat z} of a regression of \, \mathbf z on \mathbf X_2:

\mathbf P_2 \mathbf z \ = \ \mathbf X_2 \underbrace{ (\mathbf X_2' \mathbf X_2)^{-1}\mathbf X_2' \, \mathbf z }_{\boldsymbol {\hat \delta}} \ = \ \mathbf X_2 \boldsymbol {\hat \delta} \ = \ \mathbf{\hat z}

  • Thereby \boldsymbol{\hat \delta} is the estimated regression parameter vector of the regression: \, \mathbf z = \mathbf X_2 \boldsymbol{\hat \delta} + \mathbf{\hat e}

  • Multiplying \mathbf M_2 with an appropriately dimensioned arbitrary vector \mathbf z yields the residuals \mathbf {\hat e} of a regression of \, \mathbf z on \mathbf X_2:

\mathbf M_2 \mathbf z \, = \, (\mathbf I - \mathbf P_2) \mathbf z \, = \, \mathbf z - \mathbf P_2 \mathbf z \, = \, \mathbf z - \mathbf X_2 \boldsymbol{\hat\delta} \, = \, \mathbf z - \mathbf{\hat z} \, = \, \boldsymbol{\hat e}


  • Both projection matrices are of enormous importance in econometrics.

    • Both matrices are symmetric and idempotent which means: \, \mathbf P_2 \mathbf P_2 = \mathbf P_2 and \, \mathbf M_2 \mathbf M_2 = \mathbf M_2

    • Furthermore \, \mathbf P_2, \mathbf M_2 are orthogonal: \, \mathbf P_2 \mathbf M_2 = \mathbf M_2 \mathbf P_2 = \mathbf 0


Now, we pre-multiply the original multiple regression model, Equation 2.21, with \, \mathbf M_2:

\mathbf M_2 \mathbf{y} = \mathbf M_2 \mathbf{X}_1 \boldsymbol{\hat\beta}_1 + \underbrace{\mathbf M_2 \mathbf{X}_2 \boldsymbol{\hat\beta}_2}_{\mathbf 0} + \underbrace{\mathbf M_2\mathbf{\hat u}}_{\mathbf{\hat u}}

  • To see that \, \mathbf M_2 \mathbf{X}_2 \boldsymbol{\hat\beta}_2= \mathbf 0 we can simply multiply out. But from a content point of view, it is clear that regressing \mathbf X_2 on \mathbf X_2 yields zero-residuals

  • Furthermore we have \, \mathbf M_2 \mathbf{\hat u} = (\mathbf I - \mathbf P_2) \mathbf{\hat u} = \mathbf{\hat u}

    • \mathbf P_2\mathbf{\hat u} = \mathbf 0 \, because OLS-residuals are always uncorrelated with regressors – \mathbf X_2'\mathbf{\hat u} = \mathbf 0 (orthogonality property of OLS). So, regressing OLS-residulas on some of their regressors leads to a regresson with zero fit and therefore unchanged residuals

Hence, after pre-multiplying the original regression model with \, \mathbf M_2 we have:

\mathbf M_2 \mathbf{y} = \mathbf M_2 \mathbf{X}_1 \boldsymbol{\hat\beta}_1 + \mathbf{\hat u}

  • On the left hand side we have the residuals of a regression of \, \mathbf y on \mathbf X_2: \, \mathbf M_2 \mathbf{y}

  • On the right hand side we have residuals of a regression of \, \mathbf X_1 on \mathbf X_2: \, \mathbf M_2 \mathbf X_1

Thus, regressing the left hand residuals only on the right hand residuals yields the very same estimate \boldsymbol{\hat \beta}_1 and the very same residuals \mathbf{\hat u} as the original multiple regression model. This proofs the content of the FWL Theorem.

Important Remark: If \, \mathbf X_2 only consists of the unit vector \mathbf i = (1,1, \ldots, 1)', we have: \,\mathbf M_{\mathbf i } = (\mathbf I - \frac {1}{n} \mathbf i \, \mathbf i') and \mathbf P_{\mathbf i} \mathbf X_1 = \mathbf{\bar X_1} and therefore \, \mathbf M_{\mathbf i} \mathbf X_1 = (\mathbf{X_1} - \mathbf{\bar X_1}). Hence, intercept \rightarrow demeaning


2.6 Algebraic properties of OLS estimates

  • Sum of residuals is zero
    \sum_{i=1}^n \hat u_i = 0
    Follows from the Normal Equation Equation 1.8

  • Orthogonality of x_j and \hat u_i - zero covariance
    \sum_{i=1}^n x_{ij} \hat u_i = 0
    Follows from the Normal Equation Equation 1.9

  • Sample averages of y and of the regressors lie on regression line (regression hyperplane)
    \bar{y}=\widehat{\beta}_{0}+\widehat{\beta}_{1} \bar{x}_{1}+\ldots+\widehat{\beta}_{k} \bar{x}_{k}
    Follows from Equation 1.10

  • Orthogonality allows a decomposition of total variation of y (see Section 1.3.6)
    SST = SSE + SSR \ \ \rightarrow

  • Goodness-of-Fit measure: R2
    R^2 := SSE/SST = 1 - SSR/SST

  • Regressing a model, as usual, with an intercept yields the very same results as when regressing all variables demeaned (including y), but without an intercept (follows from the FWL theorem, see proof above, last remark). Hence, the role of the intercept is demeaning all the variables

  • Alternative expression for R-squared: In OLS model with intercept, R2 is equal to the squared correlation between the actual values y and predicted values \hat y

R^{2} := \ \dfrac{\left(\sum_{i=1}^{n}\left(y_{i}-\bar{y}\right)\left(\hat{y}_{i}-\bar{\hat{y}}\right)\right)^{2}}{\left(\sum_{i=1}^{n}\left(y_{i}-\bar{y}\right)^{2}\right) \left( \sum_{i=1}^{n} ( \hat{y}_{i}-\bar{\hat{y}} )^{2} \right)} \tag{2.24}

  • R2 never decreases when an additional regressor is added

  • This suggest using an adjusted measure of fit that accounts for the degrees of freedom, df, of the nominator and denominator:
    Adjusted R2 or R bar squared

\bar{R}^{2} := \ 1-\dfrac{SSR/(n-k-1)}{SST/(n-1)}=1-\dfrac{n-1}{n-k-1}\left(1-R^{2}\right)

  • This measure can decrease with additional variables if they do not contribute to explain y in a relevant way (the t-statistic has to be greater than one)

Orthogonality property of OLS

Proposition: The explanatory variables are orthogonal to the OLS residuals:

\mathbf X^{\prime} \hat{\mathbf{u}} \ = \ 0

Proof:

\hat{\mathbf{u}} \ = \ (\mathbf{y}-\mathbf{X} \hat{\boldsymbol{\beta}}) \ = \ (\mathbf{y}-\mathbf{X} \underbrace{\left(\mathbf{X}^{\prime} \mathbf{X}\right)^{-1} \, \mathbf{X}^{\prime} \mathbf{y}}_{\hat{\boldsymbol{\beta}}}) \tag{2.25}

Multiplying Equation 2.25 by \ \mathbf X'

\Rightarrow \ \ \mathbf{X}' \hat{\mathbf{u}} \ = \ (\mathbf{X}^{\prime} \mathbf{y}-\underbrace{\mathbf{X}^{\prime} \mathbf{X}\left(\mathbf{X}^{\prime} \mathbf{X}\right)^{-1}}_{\mathbf I} \, \mathbf{X}^{\prime} \mathbf{y}) \ = \ (\mathbf{X}^{\prime} \mathbf{y}-\mathbf{X}^{\prime} \mathbf{y}) \ = \ \mathbf{0}

Corollary:

\mathbf i' \mathbf {\hat u} \ = \ \sum_{i=1}^{n} \hat{u}_{i} \ =\ 0

as the first row of \mathbf X' is the ones-vector \mathbf i = [1, 1, \ldots, 1].


2.7 Assumptions of the multiple regression model

Here, we are using the notation from Wooldridge (2019)

  • Assumption 1 (MLR.1): The model is linear in parameters

  • Assumption 2 (MLR.2): Observations (y_i, x_{i,j}) are the result of random sampling of size n from the population model. (Typically, this assumption does not hold for time series, but this assumption can be relaxed so that the dependencies between successive observations are not too strong respectively, decline fast enough)

  • Assumption 3 (MLR.3): No perfect collinearity

    • Remarks on MLR.3: This assumption only rules out perfect collinearity/correlation between explanatory variables. Constant variables are also ruled out (collinear with intercept)

    • An example for perfect collinearity; two candidates and the number of votes depend on the shares of campaign expenditures:

    vote \, A = \beta_0 + \beta_1 share A + \beta_2 share B + u

    • The shares from candidates A and B sum to one and are therefor perfect collinear with the intercept

    • Imperfect correlation is allowed, but leads to the problem of multicollinearity which regularly causes imprecise estimates and will be discussed later

  • Assumption 4 (MLR.4): Exogeneity of explanatory variableszero conditional mean or conditional mean independence assumption

    E(u_i \mid x_{1}, x_{2} \ldots , x_{k}) = E(u_i) = 0

    • The values of the explanatory variables must not contain any information about the mean of the unobserved factors. In this case \frac {\partial E(u \mid x)}{\partial x}=0

    • Discussion of the zero conditional mean assumption

      • In a multiple regression model, the zero conditional mean assumption is much more likely to hold because fewer things end up in the error term

      • Explanatory variables that are correlated with the error term are called endogenous; endogeneity is a violation of assumption MLR.4

      • Explanatory variables that are uncorrelated with the error term are called exogenous; MLR.4 holds if all explanatory variables are exogenous

      • Exogeneity is the key assumption for a causal interpretation of the partial effects measured by \beta and for unbiasedness of the OLS estimates.
        (Remark: MLR.4 also generally leads to consistent OLS estimates, meaning that with an infinite sample size the parameters can be estimated “infinitely” precise)

    • Assumption 4’ (MLR.4’): Weak Exogeneity of explanatory variables

      E(u_i \mid x_{i,1}, x_{i,2} \ldots , x_{i,k}) = E(u_i) = 0

      • This is a weaker variant of MLR.4 and only this is generally necessary for consistent estimates if the error term is iid. This variant merely requires that u_i and x_{i,k} are uncorrelated within the same observation, whereas MLR.4 requries that a particular u_i is uncorrelated with x_{k}, which means with every observation of x_{k}
  • Assumption 5 (MLR.5): Homoscedasticity of u_i

    Var(u_i \mid x_{1}, x_{2} \ldots , x_{k}) = \sigma^2

    • The value of the explanatory variables must not contain any information about the variance of the unobserved factors

    • A violation of this assumptions leads to inefficient estimates and incorrect formulas for statistical tests (both problems can be handled within OLS via appropriately transformed data or heteroscedasticity consistent estimates of the covariance matrix)

  • Assumption 6 (MLR.6): Conditional normality of u_i

    u_i \mid (x_{1}, x_{2} \ldots , x_{k}) \ \sim \ N(0, \sigma^2)

    • This assumption is needed for statistical tests in small samples

Assumption 4 (MLR.4): Exogeneity of \, \mathbf X \, (zero conditional mean assumption)

E({\mathbf{u}} \ | \ \mathbf{X}) = 0 \tag{2.26}

  • This is implied by E(u_i \ | \ \mathbf{x}_i) = 0 and MLR.2 (random sampling)

  • Assumption 4’ (MLR.4’)

    E(u_i \ | \ \mathbf{x}_i) = 0 \tag{2.27}

    (without random sampling), is weaker than Equation 2.26 and is called weak exogeneity – it requires only non-correlation between \mathbf{x}_i and u_i within a particular observation i and is necessary for consistency, whereas the stronger MLR.4 is necessary for unbiasedness


Assumption 5 (MLR.5): Homoscedasticity of \mathbf u. This assumptions together with the assumption of no autocorrelation in the errors (random sampling) can be stated in a very compact form by assuming a particular covariance matrix

\operatorname {Var} (\mathbf u \ | \ \mathbf X) = \sigma^2 \mathbf I \tag{2.28}

  • This is implied by \operatorname {Var} (u_i \mid \mathbf{x}_i) = \sigma^2 and MLR.2 (random sampling)

Assumption 5 (MLR.5) can be stated with the help of the concept of a covariance matrix

  • The covariance matrix of \mathbf u is defined by:

\operatorname {Var} (\mathbf u \ | \ \mathbf X) \ := \ E\left[ \left( \mathbf{u}-E(\mathbf{u} \ | \ \mathbf{X}) \right) \left( \mathbf{u}-E(\mathbf{u} \ | \ \mathbf{X}) \right)^{\prime} \right] \ (\text{ by def. }) \ =

E\left[(\mathbf{u}-\mathbf{0}) \cdot(\mathbf{u}-\mathbf{0})^{\prime} \ | \ \mathbf{X} \right] \ (\text { by assumpt. } E(\mathbf{u} \ | \ \mathbf{X})=\mathbf{0}) \ =

E\left[\mathbf{u} \mathbf{u}' \ | \ \mathbf X \right] \ = \ \begin{array}{l} E \ \left[\begin{array}{cccc} u_{1}^{2} & u_{1} u_{2} & \cdots & u_{1} u_{n} \\ u_{2} u_{1} & u_{2}^{2} & \cdots & u_{2} u_{n} \\ \vdots & \vdots & \ddots & \vdots \\ u_{n} u_{1} & u_{n} u_{2} & \cdots & u_{n}^{2} \end{array}\right]_{ \ \mid\ \mathbf X} \ = \end{array}

\begin{array}{l} \left[ \begin{array}{cccc} Var\left(u_{1}\right) & {Cov}\left(u_{1}, u_{2}\right) & \cdots & {Cov}\left(u_{1}, u_{n}\right) \\ {Cov}\left(u_{2}, u_{1}\right) & Var\left(u_{2}\right) & \cdots & {Cov}\left(u_{2}, u_{n}\right) \\ \vdots & \vdots & \ddots & \vdots \\ {Cov}\left(u_{n}, u_{1}\right) & {Cov}\left(u_{n}, u_{2}\right) & \cdots & Var\left(u_{n}\right) \end{array} \right]_{ \ \mid\ \mathbf X} \end{array}

  • Imposing the assumption of random sampling (or no autocorrelation in the error term) implies that \operatorname {Cov} \left( u_i,u_j \mid \mathbf X \right) = 0, \ \forall i \neq j

  • Imposing homoscedasticity implies that \operatorname {Var} \left( u_i \, | \, \mathbf X \right) = \sigma^2, \ \forall i

  • Hence, we have:

\begin{array}{l} = & { \left[\begin{array}{cccc} \sigma^{2} & 0 & \cdots & 0 \\ 0 & \sigma^{2} & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & \sigma^{2} \end{array} \ \right]} \end{array} \ = \ \, \sigma^2 \mathbf I \\

\overset {MLR.5} {\Rightarrow} \ \ \operatorname {Var} (\mathbf u \ | \ \mathbf X) \ = \ \sigma^2 \ \mathbf I


2.8 Statistical properties of OLS estimates

2.8.1 Unbiasedness of OLS

Theorem 2.1 (Unbiasedness of OLS) \text{ }

Under MLR.1 to MLR.4 OLS parameter estimates are unbiased:

E(\hat \beta_j) \ = \ \beta_j, \ \ \ \forall j
Further, under MLR.1 to MLR.5:

E(\hat \sigma^2) \ = \ E \left[ \frac{1}{n-k-1} \sum_{i=1}^n \hat u_i^2 \right] \ = \ \sigma^2

The first step is a very important alternative representation for \, {\hat{\boldsymbol{\beta}}}, considering the “true” model: \, \mathbf y = \mathbf X \boldsymbol \beta + \mathbf u

\hat{\boldsymbol{\beta}} \ = \ (\mathbf{X}' \mathbf{X})^{-1} \mathbf{X}' \mathbf{y} \ = \ (\mathbf{X}' \mathbf{X})^{-1} \mathbf{X}^{\prime}(\underbrace{\mathbf{X} \boldsymbol{\beta}+\mathbf{u}}_{\mathbf y}) \ =

=\ \underbrace{(\mathbf{X}' \mathbf{X})^{-1} \mathbf{X}^{\prime} \mathbf{X} }_{\mathbf I} \ \boldsymbol{\beta}+ \left(\mathbf{X}^{\prime} \mathbf{X}\right)^{-1} \mathbf{X}' \mathbf{u} \ \ \ \Rightarrow

\hat{\boldsymbol{\beta}} \ = \ \boldsymbol{\beta}+\left(\mathbf{X}^{\prime} \mathbf{X}\right)^{-1} \mathbf{X}' \mathbf{u}, \ \text{ or } \tag{2.29}

(\hat{\boldsymbol{\beta}}-\boldsymbol{\beta}) \ = \ \left(\mathbf{X}^{\prime} \mathbf{X}\right)^{-1} \mathbf{X}' \mathbf{u} \tag{2.30}


Proof of \, E({\boldsymbol{\hat \beta}} ) = \boldsymbol{\beta}.

Using representation from Equation 2.29 and using assumption MLR.4 (exogeneity of \mathbf X, i.e., E({\mathbf{u}} \, | \, \mathbf{X}) = 0), we have:

E({\boldsymbol{\hat \beta}} \mid \mathbf{X}) = E \left( \boldsymbol{\beta} + \left(\mathbf{X}^{\prime} \mathbf{X}\right)^{-1} \mathbf{X}' \mathbf{u} \mid \mathbf{X}\right) \ =

\boldsymbol{\beta}+\left(\mathbf{X}^{\prime} \mathbf{X}\right)^{-1} \mathbf{X}^{\prime} \underbrace{E(\mathbf{u} \mid \mathbf{X})}_{0 \text { by assumption }} = \ \boldsymbol{\beta}

\Rightarrow \ \ E ( \hat{\boldsymbol{\beta}} ) \ = \ E_x \, \underbrace{ \left( E(\hat{\boldsymbol{\beta}} \mid \mathbf{X}) \right) }_{\boldsymbol \beta} \ = \ \boldsymbol{\beta}

as the true \boldsymbol{\beta} is a constant and therefore obviously independent of \mathbf{X}.

q.e.d.


Proof of \ E(\hat \sigma^2) = \sigma^2

Given the definition of the orthogonal projection matrix (residual maker matrix) (Equation C.10),

\mathbf M_{_\mathbf X} := \ ( \mathbf I - {\mathbf X(\mathbf X' \mathbf X)^{-1}\mathbf X'})

we have: \, \mathbf {\hat u} = \mathbf M_{_\mathbf X} \mathbf y.

And furthermore, \, \mathbf {\hat u} = \mathbf M_{_\mathbf X} \mathbf u\,:

{\mathbf{\hat u}} = \mathbf M_{_\mathbf X} \mathbf{y} = \mathbf M_{_\mathbf X} (\underbrace{\mathbf X \boldsymbol{\beta}+\mathbf{u}}_{\mathbf{y}}) = \underbrace{\mathbf M_{_\mathbf X} \mathbf{X} \boldsymbol{\beta}}_{\mathbf{0}} + \mathbf M_{_\mathbf X} \mathbf{u} = \mathbf M_{_\mathbf X} \mathbf{u}

Now, Theorem 2.1 says (don’t forget that \, \mathbf M_{_\mathbf X} is symmetric and idempotent):

E(\hat \sigma^2) \ = \ E \left[ \frac{1}{n-k-1} \sum_{i=1}^n \hat u_i^2 \right] \ = \ \left[ \frac{1}{n-k-1} \right] \ E({\mathbf{\hat u}}' {\mathbf{\hat u}}) \ =

\left[ \frac{1}{n-k-1} \right] \ E({\mathbf{u}}' \mathbf M_{_\mathbf X}{\mathbf{u}}) \ = \ \sigma^2

We proceed with the conditional expectation:

E(\hat \sigma^2 \, | \, \mathbf X) \ = \ (n-k-1)^{-1} E({\mathbf{u}}' \mathbf M_{_\mathbf X}{\mathbf{u}} \, | \, \mathbf X) \ =

(n-k-1)^{-1} E( tr ({\mathbf{u}}' \mathbf M_{_\mathbf X}{\mathbf{u}}) \, | \, \mathbf X) \ =

because {\mathbf{u}}' \mathbf M_{_\mathbf X}{\mathbf{u}} is scalar and therfore equal to its trace.

(n-k-1)^{-1} tr( E ({\mathbf{u}}' \mathbf M_{_\mathbf X}{\mathbf{u}}) \, | \, \mathbf X) \ =

because both E and tr are linear operators and therefore are exchangeable.

(n-k-1)^{-1} tr( E ({\mathbf{u}}' {\mathbf{u}} \, \mathbf M_{_\mathbf X} ) \, | \, \mathbf X) \ =

because of Equation B.6.

(n-k-1)^{-1} tr( \underbrace {E ({\mathbf{u}}' {\mathbf{u}} \, | \, \mathbf X )}_{\sigma^2 \mathbf I} \, \mathbf M_{_\mathbf X} ) \ =

because of MLR.5.

(n-k-1)^{-1} \sigma^2 \, tr( \mathbf M_{_\mathbf X} ) \ =

(n-k-1)^{-1} \sigma^2 \, tr( \mathbf I - {\mathbf X(\mathbf X' \mathbf X)^{-1}\mathbf X'} ) \ =

(n-k-1)^{-1} \sigma^2 \, \left ( tr( \mathbf I ) - tr ({\mathbf X(\mathbf X' \mathbf X)^{-1}\mathbf X'} ) \right ) \ =

because of Equation B.9.

(n-k-1)^{-1} \sigma^2 \, \left ( tr( \mathbf I ) - tr ({ \underbrace { \mathbf X' \mathbf X(\mathbf X' \mathbf X)^{-1}}_{\mathbf I_{k+1}} } ) \right ) \ =

because of Equation B.6.

(n-k-1)^{-1} \sigma^2 \, \left ( tr( \mathbf I_n ) - tr ( \mathbf I_{(k+1)} ) \right ) \ =

because of Equation B.9 and the dimension of (\mathbf X' \mathbf X) is (k+1) \times (k+1) and the dimension of \, \mathbf M_{_\mathbf X} is n \times n

(n-k-1)^{-1} \sigma^2 \, \left ( n - (k+1) \right ) \ =

\sigma^2

q.e.d.

  • This means that in “very many” repeated samples the average of the OLS estimates equal the true values

  • As unbiasedness is an average property in repeated samples, in a given particular sample, the estimates may still be far away from the true values

  • The key requirement for that property is the zero conditional mean assumption (MLR.4). Whenever E(u_i \mid x_{1}, x_{2} \ldots , x_{k}) \, \neq \, 0, OLS estimates are biased, i.e., on average, they are false

_ The additional key requirement for the unbiasedness of \hat \sigma_2 is MLR.5


2.8.2 Variance of OLS

Theorem 2.2 (Variance of OLS) \text{ }

Under MLR.1 to MLR.5 the variance of the OLS estimates is:

{Var(\hat \beta_j \mid x_{1}, \ldots , x_{k})} \ = \ \dfrac{\sigma^2}{ \underbrace{SST_{j}}_{\sum_{i=1}^n (x_{i,j} - \bar x_j)^2} (1-R_j^2) } \ = \ \frac {1}{n} \dfrac {\sigma^2}{s_{x_j}^2} \dfrac {1}{(1-R_j^2)} \tag{2.31}

The claim of Equation 2.31 for a slope parameter is quite obvious although a little bit tedious.
From Theorem 1.2 we know, that for the simple regression model we have:

\operatorname{Var}(\hat \beta_1|x_1, \ldots, x_n) \ = \ \dfrac {\sigma^2}{SST_x} \tag{2.32}

From Equation 2.17 and Equation 2.16 respectively, the discussion about the FWL-theorem we know that a multiple regression model can be reduced to a simple one, for every particular parameter we want to estimate, for instance for \beta_j.

Hence, Equation 2.31 is also relevant for a any parameter j of a multiple regression regression.

However, the x-variable in regression Equation 2.17 (which was \hat r_1) consists itself of the residuals of a first stage regression of x_j on [1, x_1, \ldots, x_{j-1}, \ \ \ x_{j+1}, \ldots, x_k] and therefore SST_x from Equation 2.32 above is equal to the SSR_j of that regression of x_j on $[1, x_1, , x_{j-1}, , x_{j+1}, , x_k]$$.

SST_x \ = \ \sum_{i=1}^n (\hat r_i - \bar {\hat r}_i)^2 \ = \ SSR_j

Furthermore, we know from Equation 1.14 that SST_j = SSE_j + SSR_j, which implies:

SSR_j = SST_j - SSE_j \tag{2.33}

whereby SST_j = \sum_{i=1}^n (x_{i,j} - \bar x_j).

From the definition of R^2, Equation 1.15, we have for the R^2_j of the first stage regression, R^2_j = \frac {SSE_j}{SST_j}, which implies:

SSE_j = SST_j \cdot R^2_j

Inserting this into Equation 2.33 yields:

SSR_j \ = \ SST_j - SST_j \cdot R^2_j \ = \ SST_j(1-R^2_j)

So, we can substitute SST_j(1-R^2_j) for SST_x in Equation 2.32 and arrive to Equation 2.31.

q.e.d.


{Var(\hat \beta_j \mid x_{1}, x_{2} \ldots , x_{k})} from Equation 2.31 is the j^{th} diagonal element of the covariance matrix for \boldsymbol {\hat \beta}. So, lets calculate that covariance matrix.

The covariance matrix of {\mathbf {\hat \beta}}

Under the assumption MLR.5

\operatorname{Var}(\mathbf u \mid \mathbf X) = E (\mathbf{u u}' \mid \mathbf{X}) = \sigma^2 \mathbf I \, ,

unbiasedness of \, \boldsymbol {\hat \beta}, Equation C.4 and the definition of a covariance matrix, Equation C.6, we have:

\operatorname{Var}(\boldsymbol{\hat \beta} \mid \mathbf{X}) := \ E\left[(\boldsymbol{\hat \beta}-E(\boldsymbol{\hat \beta}))(\boldsymbol{\hat \beta}-E(\boldsymbol{\hat \beta}))^{\prime} \mid \mathbf{X}\right]\ =

E\left[(\boldsymbol{\hat \beta}-\boldsymbol{\beta})(\boldsymbol{\hat \beta}-\boldsymbol{ \beta})^{\prime} \mid \mathbf{X}\right] \ =

E [ (\underbrace{(\mathbf{X}' \mathbf{X})^{-1} \mathbf{X}^{\prime} \mathbf{u}}_{(\hat{\beta}-\beta)} )(\underbrace{(\mathbf{X}' \mathbf{X})^{-1} \mathbf{X}' \mathbf{u}}_{(\hat{\beta}-\beta)^{\prime}})^{\prime} \mid \mathbf{X}] \ =

E\left[ (\mathbf{X}' \mathbf{X})^{-1} \mathbf{X}' \mathbf{u} \mathbf{u}^{\prime} \mathbf{X}(\mathbf{X}' \mathbf{X})^{-1} \mid \mathbf{X} \right] \ =

\left(\mathbf{X}^{\prime} \mathbf{X}\right)^{-1} \mathbf{X}^{\prime} \underbrace{ \ E\left(\mathbf{u u}' \mid \mathbf{X}\right)}_{\sigma^{2} \mathbf I} \ \mathbf{X}(\mathbf{X}' \mathbf{X})^{-1}

\Rightarrow \ \ \operatorname{Var}( \boldsymbol{\hat\beta} \mid \mathbf X) \ = \ \sigma^{2}(\mathbf{X}' \mathbf{X})^{-1} \tag{2.34}


Applying formula Equation C.3 and variance formula Equation C.7 we directly get:

\operatorname{Var}(\hat{\boldsymbol{\beta}} \mid \mathbf{X}) \ = \ \operatorname{Var} [ \underbrace{ \boldsymbol{\beta} +\left(\mathbf{X}^{\prime} \mathbf{X}\right)^{-1} \mathbf{X}^{\prime} \mathbf{u}}_{ \hat{\boldsymbol{\beta}}} \mid \mathbf{X}] \ =

\underbrace{ \left(\mathbf{X}^{\prime} \mathbf{X}\right)^{-1} \mathbf{X}^{\prime}}_{\mathbf A} \ \underbrace{ \operatorname{Var}(\mathbf{u} \mid \mathbf{X})}_{\mathbf \Sigma} \ \underbrace{\mathbf{X}\left(\mathbf{X}^{\prime} \mathbf{X}\right)^{-1}}_{\mathbf A'} \ =

\left(\mathbf{X}^{\prime} \mathbf{X}\right)^{-1} \mathbf{X}^{\prime} \ \sigma^{2} \mathbf{I} \ \mathbf{X}\left(\mathbf{X}^{\prime} \mathbf{X}\right)^{-1} \ =

\sigma^{2}\left(\mathbf{X}^{\prime} \mathbf{X}\right)^{-1}

  • Besides MLR.4, the key requirement for this theorem is MLR.5, homoscedasticity. If this assumption is violated, i.e, the variances of u_i depend on the explanatory variables (the x-es), the formula above is simply false (for this case, there exits a modified formula)

Factors determining \operatorname {Var}(\hat \beta_j \, | \, \mathbf x)

According Equation 2.31, the variance of \hat \beta_j

  • increases with \sigma^2, the variance of the error term

    • a high error variance increases the sampling variance of \beta_j because there is more “noise” in the equation; this necessarily makes estimates more imprecise

    • Remark: The error variance itself does not decrease with sample size

  • decreases with SST_{j}, the total sum of squared (x_j-\bar x_j)

    • generally, more sample variation in the x-es leads to more precise estimates

    • SST_{j} automatically increases with the sample size n. Increasing the sample size is thus a way to get more precise estimates

      • These two factors are identical to the term \frac {1}{n} \frac {\sigma^2}{s_{x_{(j)}}^2} we already had with the simple regression model – compare Theorem 1.2.
  • increases with R_j^2, the R-squared of a regression of x_j on all the other right hand variables. This is the new aspect of a multiple regression model

    • In particular, this is a measure of the degree of multicollinearity between the x-es. The higher the correlation between x_j with the other x-es, the lower the independent variation of x_j and the less additional information contained in the variations of x_j

    • Multicollinearity may be measured by the variance inflation factor: VIF_j := 1 / (1-R_j^2). As an (quite arbitrary) rule of thumb, the variance inflation factor should not be larger than 10

Thus, the variance of \hat \beta_j in a multiple regression model corresponds to the variance of the simple regression model times the variance inflation factor, see Equation 2.31


Variance estimation

According Theorem 2.1, an unbiased estimator of \sigma^2 is given by

\hat \sigma^2 \ = \ \frac{1}{n-k-1} \sum_{i=1}^n \hat u_i^2 \ = \ \dfrac{SSR}{df} \tag{2.35}

Given Equation 2.31, an unbiased estimator for the variance of \hat \beta_j is

\widehat{Var(\hat \beta_j)} \ = \ \dfrac{\hat \sigma^2}{SST_{j} (1-R_j^2)} \tag{2.36}

And for the standard error of \hat \beta_j

se(\hat \beta_j) = \sqrt{\widehat{Var(\hat \beta_j)}} \ = \ \dfrac{\hat \sigma}{\sqrt{SST_{j} (1-R_j^2)}} \ = \ \dfrac{ \overbrace{S.\! E.}^{\hat \sigma \ :=}}{\sqrt{SST_{j} (1-R_j^2)}} \tag{2.37}


  • We assume the true model to be: \, y = 1 + \beta_1 x_1 + \beta_2 x_2 + u

    • We assume the true parameters to be: \, \beta_1 = 1 \, and \, \beta_2 = 1

    • We generate random vectors for x_1, x_2 and u with 100 observations each and calculate y according the model above

    • Thereby, the standard deviation of u is 1, of x_1 is 2 and of x_2 is 4

    • The variables x_1 and x_2 are not correlated hence, no multicollinearity

  • We estimate the model above with the generated variables y, x_1 and x_2 by OLS (with lm()) and store the values of \hat \beta_1 and \hat \beta_2. This estimation procedure is repeated 10000 times

    • Finally, we calculate the means and standard deviations of the stored 10000 \hat \beta_1 and \hat \beta_2

    • Thus, we expect

      • mean(\hat \beta_1) = 1
      • mean(\hat \beta_2) = 1
    • And furthermore, according Equation 2.31, we expect (note, SST_j = (n-k-1) \times \sigma_j^2)

      • se(\hat \beta_1) = \sqrt{(\frac {1}{97 \times 2^2})} = 0.0508
      • se(\hat \beta_2) = \sqrt{(\frac {1}{97 \times 4^2})} = 0.0254
Code
n=100        # number ob observations
repli=10000  # number replication
distri=1     # which distribution for u is used: 1 = normal distibution 
  
B1 = rep(NA, repli)  # initialize B1 for storing estimated b1
B2 = rep(NA, repli)  # initialize B2 for storing estimated b2

for (i in 1:repli) { 
  # u is normal distributed
   if (distri == 1) {
      u  = rnorm(n, mean = 0, sd = 1) # random normal vector, with E(u)=0
      x1 = rnorm(n, mean = 1, sd = 2) # random normal vector, with E(x1)=1
      x2 = rnorm(n, mean = 2, sd = 4) # random normal vector, with E(x2)=2
      y = 1 + x1 + x2 + u
   }
  # u is uniformly distributed
   if (distri == 2) {
      u = runif(n, min = -2, max = 2) # random uniform vector, with E(u)=0
      x1 = rnorm(n, mean = 1, sd = 2) # random normal vector, with E(x1)=1
      x2 = rnorm(n, mean = 2, sd = 4) # random normal vector, with E(x2)=2
      y = 1 + x1 + x2 + u
   }
  
   model <- lm(y ~ x1 + x2)
   b = coef(model)
  
   B1[i] = b[2]  # storing estimated b1
   B2[i] = b[3]  # storing estimated b2
}
# end loop

# results
mB1 = round(mean(B1), digits = 4)
mB2 = round(mean(B2), digits = 4)

sd1 = round(sd(B1), digits = 4)
sd2 = round(sd(B2), digits = 4)

# theoretical se
sd1th = sqrt( 1 / ( (n-3)*2^2 ) )
sd2th = sqrt( 1 / ( (n-3)*4^2 ) )

hist(B1, breaks = 25, freq = FALSE, xlim=c(0.8, 1.2), 
     main = paste0( "mean(beta1) = ", mB1, ",   sd(beta1) = ", sd1) )
curve(dnorm(x, mean = 1 ,sd=sd1th), add = TRUE, col = "red") 


hist(B2, breaks = 25, freq = FALSE, xlim=c(0.8, 1.2),
     main = paste0( "mean(beta2) = ", mB2, ",   sd(beta2) = ", sd2) )
curve(dnorm(x, mean = 1 ,sd=sd2th), add = TRUE, col = "red") 

(a) Theoretical and sample distribution of \hat \beta_1. Standard deviation of x_1 is 2

(b) Theoretical and sample distribution of \hat \beta_2. Standard deviation of x_2 is 4

Figure 2.2: Distibution of estimated regression parameters, n = 100, repli = 10000


2.8.3 The Gauss-Markov Theorem

Theorem 2.3 (Gauss-Markov Theorem) \text{ }

Under assumptions MLR.1 - MLR.5, the OLS estimators are the best linear unbiased estimators (BLUEs) of the regression coefficients, i.e.

Var(\hat \beta_j) \ \leq \ Var( \tilde \beta_j), \ \ \forall \ \, \tilde \beta_j = \sum_{i=1}^n c_{ij}y_i \ \text{ for which } \ E(\tilde \beta_j) = \beta_j

Our model once again:

\mathbf y = \mathbf X \boldsymbol \beta + \mathbf u The OLS estimator is:

\hat{\boldsymbol{\beta}} \ = \ \underbrace {(\mathbf{X}' \mathbf{X})^{-1} \mathbf{X}'}_{:= \mathbf A} \mathbf{y} \ = \ \mathbf A \mathbf{y} We have another linear and unbiased estimator \tilde{\boldsymbol{\beta}}:

\tilde{\boldsymbol{\beta}} \ = \ \mathbf C \mathbf{y} \ \ \Rightarrow

\tilde{\boldsymbol{\beta}} \ = \ \mathbf C \underbrace {(\mathbf X \boldsymbol \beta + \mathbf u)}_{\mathbf y} \ = \ \mathbf C \mathbf X \boldsymbol \beta + \mathbf C \mathbf u \tag{2.38}

\tilde{\boldsymbol{\beta}} should be unbiased (using MLR.4):

E(\tilde{\boldsymbol{\beta}} \mid \mathbf X) \ = \ E(\mathbf C \mathbf X \boldsymbol \beta + \mathbf C \mathbf u \mid \mathbf X) \ = \ \mathbf C \mathbf X \, {\boldsymbol{\beta}} + \mathbf C \, \underbrace {E(\mathbf u \mid \mathbf X)}_{\mathbf 0} \ = \ \mathbf C \mathbf X \, \boldsymbol{\beta}

Hence, for unbiasedness of \tilde{\boldsymbol{\beta}} we must have:

\mathbf C \mathbf X = \mathbf I \tag{2.39}

Furthermore, from Equation 2.38, Equation 2.39 and Equation C.7 (using MLR.5) we have

\operatorname {Var}(\tilde{\boldsymbol{\beta}}) \ = \ \sigma^2 \mathbf C \mathbf C'

Define the difference of the covariance matrices:

\mathbf D \ = \ [\operatorname {Var}(\tilde{\boldsymbol{\beta}}) - \operatorname {Var}(\hat {\boldsymbol{\beta}})] \ = \ \sigma^2 [\mathbf C \mathbf C' - (\mathbf{X}' \mathbf{X})^{-1}]

But this equals to:

\mathbf D \ = \ \sigma^2 \mathbf C[\underbrace {\mathbf I - \mathbf X(\mathbf{X}' \mathbf{X})^{-1}\mathbf X'}_{\mathbf M}]\mathbf C'

because of Equation 2.39: \, \mathbf C \mathbf X = \mathbf I = \mathbf X' \mathbf C', rule Equation B.3.

However, the middle part of \mathbf D is equal to orthogonal projection matrix \mathbf M, see Equation C.10.

  • We know that \mathbf M is idempotent and therefore positive semi definite, p.s.d. (see Appendix B and C)

  • Thus, the covariance difference matrix \mathbf D is p.s.d. as well (Appendix B), which means that every element of the main diagonal is \geq 0.

Hence, \operatorname {Var} (\tilde \beta_j) \geq \operatorname {Var} (\hat \beta_j), \ \forall j.

q.e.d.


This means that within the class of unbiased and linear (in y_i) estimators, OLS is the one with the lowest variance (with the most precision). We say, OLS is efficient.

  • The limitation to estimators linear in y_i is somewhat arbitrary. However, we have two important enhancements:

    • If we are ready to assume normally distributed errors, MLR.6., OLS is equivalent to a Maximum Likelihood estimation (MLE), which is generally asymptotically efficient in the class of consistent and normally distributed estimators

    • An even stronger enhancement is Theorem 4.5 in Hansen (2022). This shows that the restriction to linear estimators in Theorem 2.3 can be completely dropped. The proof of this extended version of the Gauss-Markov Theorem is far beyond the scope this book, see Hansen (2022), Section 4.26


2.9 Misspecifications

2.9.1 Irrelevant variables

Consider the true model

y \ = \ \beta_0 + \beta_1 x_1 + \beta_2 x_2 + u

But we estimate

y \ = \ \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \textcolor {red} {\beta_3 x_3} + v

  • This case, which is called overspecification, poses no problem regarding the unbiasedness, because MLR.4 is not violated through the inclusion of irrelevant variables

  • Hence, OLS estimates are still unbiased and deliver on average the true values

  • In particular, as \beta_3 = 0, we will get \hat \beta_3 = 0 on average

  • However, inclusion of irrelevant variables leads to less precise estimates, as irrelevant variables are often correlated with the relevant ones in actual samples

    • So, the variance of \hat \beta_j increases because of a more pronounced multicollinearity problem – R_j^2 increases in Equation 2.31

2.9.2 Omitted variables

Consider the true model

y \ = \ \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \textcolor {green} {\beta_3 x_3} + u

However, we estimate the following misspecified model, omitting the relevant variable x_3 and denoting the coefficients with b instead of \beta

y \ = \ b_0 \, + \, b_1 x_1 \, + \, b_2 x_2 \, + \underbrace {(\textcolor {green} {\beta_3 x_3} + u)}_{v}

  • The new error term v now includes the omitted variable: v = (\beta_3 x_3 + u)

  • Therefore, if the omitted variable x_3 is correlated with one or both of the other variables x_1, x_2, as is often the case, assumption MLR.4 is violated, because the new error term v is then correlated via x_3 with the remaining variables x_1, x_2

  • Hence, we would expect: E(\hat b_j) \neq \beta_j, \ \, j=0,1,2

  • This case is called underspecification


Omitted Variable Bias

We assume that x_3 is related to the other x-es by

x_3 = \delta_0 + \delta_1 x_1 + \delta_2 x_2 + e \tag{2.40}

Plugging this into the true model and collecting terms we get:

y \ = \ \underbrace{(\beta_0 + \textcolor {green} {\beta_3 \delta_0})}_{b_0} + \underbrace{(\beta_1 + \textcolor {green}{\beta_3 \delta_1})}_{b_1} x_1 + \underbrace{(\beta_2 + \textcolor {green}{\beta_3 \delta_2})}_{b_2} x_2 + \underbrace{(u + \textcolor {green}{\beta_3 e})}_v \tag{2.41}

  • So, estimating this underspecified model, the estimated coefficients b_j capture both

    • the direct (true partial) effect of x_j on y, represented by \beta_j

    • an effect of x_3 on y, which is falsely attributed to x_j because of the correlation between x_j and x_3. The extent of this bias is \beta_3 \delta_j

  • Hence, the Omitted Variable Bias depends on:

    • \beta_3, the effect of the omitted variable on y. If \beta_3=0, x_3 is actually irrelevant \ \rightarrow no bias

    • \delta_j, the regression coefficient of x_j on the omitted variable x_3. If the omitted variable is not (partially) correlated with the remaining explanatory variables, \delta_j=0, \ j=0,1,2, \ \rightarrow no bias


Under– versus overspecification

  • The omitted variable bias is one of the most important problems in applied econometrics. Often, a lack of data for important variables is the reason, e.g. ability in the wage equation

  • Sometimes, however, it is unclear, whether a particular variable should be included in the model

    • If the variable is relevant and it is not included, an omitted variable bias emerges

    • If the variable is irrelevant and included, the variance of the estimates increases

  • If there is a severe multicollinearity problem then it could be better to exclude the variable, even if the variable is relevant. Two effect are then at work

    • The variance of error term v in Equation 2.41 is larger than the variance of error term u in the true model, \sigma_v^2 = \beta_3^2 \sigma_e^2 + \sigma^2. As the variance formula in Equation 2.31 shows this increases the variance of the OLS estimates

    • But on the other hand, omitting a variable reduces the multicollinearity problem – R_j^2 in the variance formula of Equation 2.31 decreases and so the variance of the OLS estimates

    • If the multicollinearity problem was really severe, the latter effect could dominate and hence omitting a relevant variable could be beneficial, especially in small samples with many regressors

  • But even if the latter effect dominates and the variance decreases, this net-effect has to outweigh the bias (the MSE is relevant here). The following example illustrates that matter


Example – omitted variables

We simulate a model with three explanatory variables and \beta_0 = \beta_1 = \beta_2 = \beta_3 = 1

y \ = \ 1 + x_1 + x_2 + x_3 + u

The x_j are mutually related through three error processes, e_1, e_2, e_3 and an additional noise process \eta_j:

x_1 = 0.4 e_1 + 0.3 e_2 + 0.3 e_3 + \eta_1, \quad x_2 = 0.3 e_1 + 0.4 e_2 + 0.3 e_3 + \eta_2

x_3 = 0.3 e_1 + 0.3 e_2 + 0.4 e_3 + \eta_3

(e_1, e_2, e_3, u) are N(0,1) distributed random vectors of sample size n = 1,000

We investigate what happens if x_3 is omitted. According Equation 2.41, we expect that \hat b_j = \hat \beta_j + \hat \beta_3 \hat \delta_j, \ j=0, 1, 2


Simulating data, low multicollinearity
Code
library(AER)
library(modelsummary) # for pretty output 

# generating variables x1-x3, y = x1 + x2 + x3 + u 

n  = 1000  # sample size
e1 = rnorm(n); e2 = rnorm(n); e3 = rnorm(n);
x1 = 0.4*e1 + 0.3*e2 + 0.3*e3 + rnorm(n,0,0.5);
x2 = 0.3*e1 + 0.4*e2 + 0.3*e3 + rnorm(n,0,0.5); 
x3 = 0.3*e1 + 0.3*e2 + 0.4*e3 + rnorm(n,0,0.5);
u  = rnorm(n)
y = 1 + x1 + x2 + x3 + u

Regressing the true model
Code
out1 <- lm(y ~ x1 + x2 + x3)
sout1 <- summary(out1) 

modelsummary(list("True Model"=out1), shape =  term ~ statistic ,
             statistic = c('std.error', 'statistic', 'p.value', 'conf.int'), 
             stars = TRUE, 
             gof_omit = "A|L|B|F",
             align = "ldddddd",
             output = "gt")
True Model
Est. S.E. t p 2.5 % 97.5 %
(Intercept)    0.949***  0.032  29.634  <0.001  0.886  1.012
x1    0.879***  0.054  16.207  <0.001  0.773  0.985
x2    1.084***  0.051  21.099  <0.001  0.983  1.185
x3    1.042***  0.055  18.943  <0.001  0.934  1.149
Num.Obs. 1000      
R2    0.786   
RMSE    1.01    
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001

Regressing the underspecified model
Code
out2 <- lm(y ~ x1 + x2)
sout2 <- summary(out2)

modelsummary(list("Underspecified Model"=out2), shape =  term ~ statistic ,
             statistic = c('std.error', 'statistic', 'p.value', 'conf.int'), 
             stars = TRUE, 
             gof_omit = "A|L|B|F",
             align = "ldddddd",
             output = "gt")
Underspecified Model
Est. S.E. t p 2.5 % 97.5 %
(Intercept)    0.925***  0.037  24.795  <0.001  0.852  0.998
x1    1.306***  0.058  22.708  <0.001  1.193  1.419
x2    1.414***  0.056  25.083  <0.001  1.303  1.525
Num.Obs. 1000      
R2    0.709   
RMSE    1.17    
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001

Auxiliary model, Equation 2.40: Regressing x3 as functions of x1 + x2
  • The R2 is a measure for multicollinearity. Here, there is a moderate degree of multicollinearity
Code
# auxiliary model: regressing x3 as functions of x1 + x2, large noise, low multicollinearity
aux <- lm(x3 ~ x1 + x2)
saux <- summary(aux) 

modelsummary(list("Auxiliary Regression of x3"=aux), shape =  term ~ statistic ,
             statistic = c('std.error', 'statistic', 'p.value', 'conf.int'), 
             stars = TRUE, 
             gof_omit = "A|L|B|F",
             align = "ldddddd",
             output = "gt")
Auxiliary Regression of x3
Est. S.E. t p 2.5 % 97.5 %
(Intercept)   -0.023     0.018  -1.256   0.209  -0.059  0.013
x1    0.410***  0.028  14.424  <0.001   0.354  0.466
x2    0.316***  0.028  11.362  <0.001   0.262  0.371
Num.Obs. 1000      
R2    0.416   
RMSE    0.58    
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001

Calulating the coefficients of the underspecified model according Equation 2.41
Code
# estimated coefficients of true and underspecified model
coef_t <- out1$coefficients;  coef_f <- out2$coefficients

# calculated coefficients according text above
# coefficients true model + coefficient of omitted variable x3 * coefficients of auxiliary model 
coef_cal <- out1$coefficients[1:3] + out1$coefficients["x3"] * aux$coefficients

# standard errors of true and underspecified model
se_t <- sout1$coefficients[1:3,2];  se_f <- sout2$coefficients[1:3,2]; 

# output matrix for table
tab = cbind(coef_t[1:3], coef_f, coef_cal, se_t, se_f)
colnames(tab) = c("True_Model", "Missp_Model", 
                  "Calc_Coeffs", "se_True_Model","se_Missp_Model")
vars <- rownames(tab)

# printing tab
# tab

Summarizing results
  • We have an enormous omitted variable bias

  • Note that the calculated coefficients (according Equation 2.41) for the underspecified model exactly matches the estimated coefficients for the underspecified model

  • Furthermore, the estimated standard errors are somewhat higher for the underspecified model

Code
library(gt)
# `gt()` needs a data frame as argument and "tab" is a matrix. 
#  We are rounding to 3 digits
#  The name of the variables are store in the vector "vars" 

gt(data.frame(vars ,round(tab, 3)))
Table 2.2:

True and underspecified models and calculated coefficients, low multicollinearity

vars True_Model Missp_Model Calc_Coeffs se_True_Model se_Missp_Model
(Intercept) 0.949 0.925 0.925 0.032 0.037
x1 0.879 1.306 1.306 0.054 0.058
x2 1.084 1.414 1.414 0.051 0.056

Simulating data, high multicollinearity
Code
# generating variables x1-x3, which are correlated and y = x1 + x2 + x3 + u
# variance of noise process eta much lower => higher multicollinearity

n  = 1000  # sample size
e1 = rnorm(n); e2 = rnorm(n); e3 = rnorm(n);
x1 = 0.4*e1 + 0.3*e2 + 0.3*e3 + rnorm(n,0,0.05);
x2 = 0.3*e1 + 0.4*e2 + 0.3*e3 + rnorm(n,0,0.05); 
x3 = 0.3*e1 + 0.3*e2 + 0.4*e3 + rnorm(n,0,0.05);
u  = rnorm(n)
y = 1 + x1 + x2 + x3 + u
Code
# regressing the true model
out11 <- lm(y ~ x1 + x2 + x3)
sout11 <- summary(out11) 
Code
# regressing the underspecified model (omitting x3)
out22 <- lm(y ~ x1 + x2)
sout22 <- summary(out22) 
Auxiliary model, Equation 2.40: Regressing x3 as functions of x1 + x2; much better fit than before
Code
# auxiliary model: regressing x3 as functions of x1 + x2, small noise: sigma of eta 1/10 of previous case 

aux <- lm(x3 ~ x1 + x2)
saux <- summary(aux) 

modelsummary(list("Auxiliary Regression of x3"=aux), shape =  term ~ statistic ,
             statistic = c('std.error', 'statistic', 'p.value', 'conf.int'), 
             stars = TRUE, 
             gof_omit = "A|L|B|F",
             align = "ldddddd",
             output = "gt")
Auxiliary Regression of x3
Est. S.E. t p 2.5 % 97.5 %
(Intercept)   -0.002     0.005  -0.331   0.741  -0.010  0.007
x1    0.476***  0.028  16.793  <0.001   0.420  0.532
x2    0.506***  0.028  17.888  <0.001   0.451  0.562
Num.Obs. 1000      
R2    0.937   
RMSE    0.14    
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001
Code
# estimated coefficients of true and underspecified model
coef_t <- out11$coefficients;  coef_f <- out22$coefficients

# calculated coefficients according text above
# coefficients true model + coefficient of omitted variable x3 * coefficients of auxiliary model 
coef_cal <- out11$coefficients[1:3] + out11$coefficients["x3"] * aux$coefficients

# standard errors of true and underspecified model
se_t <- sout11$coefficients[1:3,2];  se_f <- sout22$coefficients[1:3,2]; 

# output matrix for table
tab = cbind(coef_t[1:3], coef_f, coef_cal, se_t, se_f)
colnames(tab) = c("True_Model", "Missp_Model", 
                  "Calc_Coeffs", "se_True_Model","se_Missp_Model")

# printing tab
# tab

Summarizing the results of both simulations
Code
lowMultiColl = list( "TrueMod"=out1, "MisspMod"=out2)
highMultiColl = list( "TrueMod"=out11, "MisspMod"=out22 )

m1 <- modelsummary(lowMultiColl,
             stars = TRUE,
             gof_omit = "A|L|B|F",
             align = "ldd",
             output = "flextable")

m2 <- modelsummary(highMultiColl,
             stars = TRUE,
             gof_omit = "A|L|B|F",
             align = "ldd",
             output = "flextable")

m1; m2;

Table 2.3: True and underspecified models for low and high multicollinearity

(a)

Low Multicollinearity

TrueMod

MisspMod

(Intercept)

   0.949***

   0.925***

  (0.032)  

  (0.037)  

x1

   0.879***

   1.306***

  (0.054)  

  (0.058)  

x2

   1.084***

   1.414***

  (0.051)  

  (0.056)  

x3

   1.042***

  (0.055)  

Num.Obs.

1000      

1000      

R2

   0.786   

   0.709   

RMSE

   1.01    

   1.17    

+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001

(b)

High Multicollinearity

TrueMod

MisspMod

(Intercept)

   0.991***

   0.989***

  (0.032)  

  (0.033)  

x1

   0.257   

   0.903***

  (0.227)  

  (0.204)  

x2

   1.294***

   1.982***

  (0.230)  

  (0.203)  

x3

   1.357***

  (0.223)  

Num.Obs.

1000      

1000      

R2

   0.725   

   0.714   

RMSE

   1.01    

   1.03    

+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001

  • For the high multicollinearity case, we have much more unprecise estimates even for the true model (and we have 1000 observations!). However, the deviations from the true values are not statistically significant

    • Note that the increase of the standard errors of the true model due to a rise of multicollinearity is totally in accordance with the variance formula of Equation 2.31. You can calculate the expected rise of the standards errors using the R2 of the two auxiliary equations. The expected and actual increase is in the range of 3.2 to 4 times
  • For the high multicollinearity case, we have an enormous omitted variable bias as well, which is statistically significant for the variable x_2

  • Now as before, the calculated coefficients (according Equation 2.41) for the underspecified model exactly matches the estimated coefficients for the underspecified model (not shown in the table)

  • Furthermore, the estimated standard errors are now lower for the underspecified model (as expected), but the bias dominates this effect by far in our simulation experiment


2.9.3 Errors in variables

  • A very common problem in econometrics are measurement errors in the model variables

  • Surprisingly, the consequences of this issue depend on whether the dependent or explanatory variables are affected by this problem

Measurements errors in the dependent variable

  • We assume that the true value of y is observed only with a measurement error

    y^O = y + e \tag{2.42}

with y^O being the actually observed value of y and e being the measurement error

y^O = \beta_0 + \beta_1 x_1 + \cdots + \beta_k x_k + ( \underbrace {u+e}_v ) \tag{2.43}

  • The only consequence is that now we have a new error term v=(u+e), with variance \sigma_v^2 = \sigma^2+\sigma_e^2, if we assume that the measurement error is uncorrelated with the error term u

    • Hence, a measurement error in y leads to more unprecise estimates of all \beta_j, i.e., larger variances of \hat \beta_j, which is not very surprising

Measurements errors in an explanatory variable

  • We assume that the true value of x_j is only observed with a measurement error

    x_j^O = x_j + e_j \tag{2.44}

    with x_j^O being the actually observed value of x_j and e_j being the associated measurement error

  • Plugging in for x_j, our base model Equation 2.1 becomes to

y = \beta_0 + \beta_1 x_1 + \cdots + \beta_j x_j^O + \cdots + \beta_k x_k + ( \underbrace{u - \beta_j e_j}_v ) \tag{2.45}

  • As a consequence, we have a new error term v=(u-\beta_je_j), with variance \sigma_v^2 = \sigma^2 + \beta_j^2 \sigma_{e_j}^2, if we assume that the measurement error is uncorrelated with the error term u. Hence, as before, a measurement error in x_j leads to more unprecise estimates of all \beta_j

  • However, there is an additional complication: Because of Equation 2.44, the observed model variable x_j^O is correlated with e_j, leading directly to a correlation of the new error term v with x_j^O and consequently to a violation of the conditional mean independence assumption MLR.4. The results are biased estimates \hat \beta


Summarizing:

  • A measurement error in the dependent variable leads to more unprecise parameter estimates whereas measurement errors in the explanatory variables additionally lead to biased estimates. This is the error in the variables bias which typically is a downward biasattenuation bias

  • However, this result strongly depends on the specification in Equation 2.44. The crucial point is whether the measurement error is correlated with the observable variable. With the specification of Equation 2.44 it is and this is called the classical errors-in-variables model

  • As an example suppose we want to measure ability in a wage equation by the IQ. In this case we have

    IQ = ability + e \tag{2.46}

    so that IQ is simply an inaccurate measure of ability with \operatorname{Corr}(IQ,\, e) > 0, leading to the bias described above

Proxy variables

However, there is another interpretation:

  • Think about ability as a competence which consists of several aspects, besides IQ, there is practical talent, diligence, readiness for efforts, work ethics, etc., the latter ones all summarized in \epsilon

    ability = \delta_0 + \delta_1 IQ + \epsilon \tag{2.47}

  • Here, we implicitly assume that \operatorname{Corr}(IQ,\, \epsilon)=0 \, ! With this approach in mind, we simply can plug in Equation 2.47 in a standard wage equation, using IQ as a so called proxy variable for ability:

    wage = \beta_0 + \beta_1 educ+ \beta_2 exper + \beta_3 (\underbrace { \delta_0 + \delta_1 IQ +\epsilon)}_{ability} + v

    Collecting terms we get:

wage = (\beta_0 + \beta_3 \delta_0) + \beta_1 educ+ \beta_2 exper + \beta_3 \delta_1 IQ + (v + \beta_3\epsilon) \tag{2.48}

E(wage \mid educ, exper, IQ) = (\beta_0 + \beta_3 \delta_0) + \beta_1 educ+ \beta_2 exper + \beta_3 \delta_1 IQ +

E(v + \beta_3\epsilon \mid educ, exper, IQ)

  • To get unbiased (or at least consistent) estimates for \beta_1 and \beta_2, the following conditions must hold

    • E(v \mid educ, exper, IQ) = 0, which is the usual conditional mean independence assumption. However, this condition is now more likely to be met because ability not a part of v

    • E(\epsilon \mid educ, exper, IQ) = 0

  • The later is the more interesting one; it demands that the other factors determining ability, \epsilon, are as well uncorrelated with the explanatory variables in the wage equation, especially with educ

  • It is doubtful whether this is true in this particular example. However, even if not, using the proxy IQ for ability might nevertheless be useful if the correlation of educ with \epsilon is smaller than the correlation of educ with the unobservable ability, thereby reducing the omitted variable bias

  • However, in any case, the researcher has to carefully argue that the conditions for the proxy variable approach are actually justifiable on theoretical grounds


Errors in variables – Example

  • We are estimating a wage equation; we are looking for the effect of education on log(wage) = lwage

  • Problem: Ability not observable and probably correlated with both educ and lwage ==> effect of education probably overestimated

  • Possible additional problem, maybe of even more importance: educ is measured in years which might be a bad indicator for “real” education ==> errors in variables problem ==> underestimation of education

library(wooldridge)
library(modelsummary)
data(wage2)

# Model without proxy for ability, but with a lot of control variables
out1 <- lm(lwage ~ educ+exper+tenure+hours+married+black+south+urban, data=wage2)

# Use of IQ as proxy variable
out2 <- lm(lwage ~ educ+exper+tenure+hours+married+black+south+urban+IQ, data=wage2)

# Use of IQ and KWW as proxy variables
out3 <- lm(lwage ~ educ+exper+tenure+hours+married+black+south+urban+IQ+KWW, data=wage2)

# Use of IQ as proxy variable; additional interaction with educ - slope effect 
out4 <- lm(lwage ~ educ+exper+tenure+hours+married+black+south+urban+IQ+I(educ*IQ), data=wage2)


modelsummary( list(out1, out2, out3, out4), 
              output="gt", 
              statistic = "statistic",
              coef_omit = "Inter",
              gof_omit = "A|B|L|F", 
              align = "ldddd", 
              stars = TRUE, 
              fmt = 4)
Table 2.4:

Wage equation with different proxy variables for ability (t-statistic in brackets)

(1) (2) (3) (4)
educ     0.0666***     0.0555***     0.0504***     0.0233   
  (10.6825)      (8.0438)      (6.9825)      (0.5701)  
exper     0.0139***     0.0140***     0.0124***     0.0138***
   (4.3862)      (4.4471)      (3.8671)      (4.3631)  
tenure     0.0113***     0.0109***     0.0104***     0.0110***
   (4.6222)      (4.5032)      (4.2647)      (4.5029)  
hours    -0.0053**     -0.0053**     -0.0056***    -0.0052** 
  (-3.1692)     (-3.1923)     (-3.3713)     (-3.1668)  
married     0.2041***     0.2044***     0.1961***     0.2054***
   (5.2478)      (5.2908)      (5.0661)      (5.3119)  
black    -0.1996***    -0.1543***    -0.1407***    -0.1575***
  (-5.3010)     (-3.9120)     (-3.5343)     (-3.9709)  
south    -0.0911***    -0.0803**     -0.0824**     -0.0804** 
  (-3.4869)     (-3.0754)     (-3.1617)     (-3.0770)  
urban     0.1854***     0.1834***     0.1766***     0.1848***
   (6.9082)      (6.8777)      (6.5983)      (6.9147)  
IQ     0.0036***     0.0031**     -0.0005   
   (3.6086)      (3.0428)     (-0.0915)  
KWW     0.0043*  
   (2.3326)  
I(educ * IQ)     0.0003   
   (0.7993)  
Num.Obs.   935          935          935          935       
R2     0.261         0.271         0.275         0.271    
RMSE     0.36          0.36          0.36          0.36     
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001

2.9.4 The random coefficients model

Until now we always assumed that the coefficient vector \boldsymbol\beta = (\beta_1, \ldots , \beta_k) is constant across all observations

  • But this assumption is quite restrictive and sometimes even implausible; for instance in a wage equation that would mean that the payoff of education is equal for all individuals

  • We now want to investigate the consequences when some unobservable factors affects the coefficients \boldsymbol\beta. In particular we assume

\boldsymbol\beta_i = \boldsymbol\beta + \boldsymbol \epsilon_i \tag{2.49}

Here, \boldsymbol\beta_i, the coefficients for observation (unit or individual) i, is equal to \boldsymbol\beta, the population mean which we want to estimate, plus an idiosyncratic error \boldsymbol \epsilon_i, which captures some unobserved factors of unit i affecting the partial effects of the explanatory variables

  • For instance, in our wage equation example that would mean that the partial effect of education on wage depends on some unobservable (individual specific) factors like diligence or motivation

  • Remark: If the partial effects \boldsymbol\beta_i depend on observable factors, we could and should deal with them by interaction terms, see Section 5.2


This formulation leads the random coefficients model with \mathbf x_i = (x_{1,i}, \ldots, x_{k,i})

y_i \ = \ \beta_0 + \mathbf x_i \boldsymbol\beta_i + u_i \ = \beta_0 + \ \mathbf x_i(\boldsymbol\beta + \boldsymbol\epsilon_i) + u_i \ \Rightarrow

y_i \ = \ \beta_0 + \mathbf x_i\boldsymbol\beta + (\mathbf x_i \boldsymbol \epsilon_i + u_i) \tag{2.50}

  • Hence, the new error term of the model is (\mathbf x_i \boldsymbol \epsilon_i + u_i) and includes \mathbf x_i which could put the unbiasedness and consistency of \hat {\boldsymbol\beta} at risk. Calculating the conditional expectation we get

E[y_i|\mathbf X] \ = \ \beta_0 + \mathbf x_i\boldsymbol\beta + \mathbf x_i E[\boldsymbol\epsilon_i|\mathbf X] + E[u_i|\mathbf X]

  • E[u_i|\mathbf X]=0 by assumption MLR.4 (exogeneity)

  • The second requirement for unbiasedness and consistency is E[\boldsymbol\epsilon_i|\mathbf X]=0

    • So, the crucial assumption is that the unobserved factors affecting \boldsymbol \beta_i are independent from \mathbf X

    • With reference to our example, unobserved factors like diligence or motivation should be independent from the included explanatory variables like education or experience

      • But these unobserved factors are frequently the same as those in u_i, (the factors in u_i affect the intercept \beta_0 and those in \epsilon_i the slopes \boldsymbol\beta_i) and so the second requirement is often not an additional constraint
  • Although OLS remains unbiased and consistent under E[\boldsymbol\epsilon_i|\mathbf X]=0, the error term (\mathbf x_i \boldsymbol \epsilon_i + u_i) of the random coefficients model almost certainly exhibits heteroskedasticity with the conditional variance containing the term \mathbf x_i \! \operatorname {Var}(\boldsymbol \epsilon_i) \, \mathbf x_i'

    • Some authors even view random coefficients (slopes) to be a main reason for heteroskedastic errors

2.9.5 Functional forms

With OLS, some non-linearities in the assumed relationship could be modeled by including non-linear transformations of the variables, such as logs, squares or higher order polynomial terms

  • Thereby, two difficulties arises

    1. The partial effects are not constant, as in a linear model, but depend on the values of the variables. So, in order to get an easy interpretable result, you have to calculate the APEs. In R this can be done with the function avg_slopes() from the marginaleffects package. With plot_predictions() you can plot the non-linear relationships at average values of the other variables

    2. It is often not clear, whether a linear specification is sufficient or whether there are important non-linearities at work. Furthermore, there are infinitely many possibilities of non-linearities. So, we need a specification test


RESET

A general test for functional form misspecification is the so called Regression Equation Specification Error Test, or simply RESET - The idea of RESET is to include squares and possibly higher order fitted values of y in the regression

y = \beta_0 + \beta_1 x_1 + \ldots + \beta_k x_j + \textcolor {red}{\delta_1 \hat y^2 + \delta_2 \hat y^3} + u \tag{2.51}

  • As \hat{y} is a linear combinations of all explanatory variables, terms like \hat y ^i are non-linear transformations of these. So, if the model is correctly specified, terms like \hat y ^i should have no impact on y

  • With RESET one simply checks whether the terms \hat y ^i are significant, i.e., we test with an F-test whether \delta_1, \delta_2, \ldots are all 0

  • This can be done

    • by hand (estimating the model, storing the predicted values \hat y, calculating \hat y ^2, \hat y ^3 and so on, estimate a second model including these terms and carry out an F-Test)

    • or more conveniently in R with the function resettest()

  • In the following example we take our wage model and check whether a specification without a non-linear term is sufficient


library(AER)
library(wooldridge); data("wage1")

# estimating the linear wage model and storing the results in list myres0
myres0 <- lm(lwage ~ educ + exper, data = wage1)

# carrying out a RESET, the default is inclusion of 2nd and 3rd order 
# terms of predicted values of y
resettest(myres0)
      
        RESET test
      
      data:  myres0
      RESET = 6.6502, df1 = 2, df2 = 521, p-value = 0.001406
# So, the H0 is clearly rejected, meaning that the model suffers from 
# a general misspecification

# Estimating the model with non-linear exper and storing the results in list myres1
myres1 <- lm(lwage ~ educ + exper + expersq, data = wage1) 


# RESET for the full model 
resettest(myres1)
      
        RESET test
      
      data:  myres1
      RESET = 6.1111, df1 = 2, df2 = 520, p-value = 0.002381
summary(myres1)
      
      Call:
      lm(formula = lwage ~ educ + exper + expersq, data = wage1)
      
      Residuals:
           Min       1Q   Median       3Q      Max 
      -1.96387 -0.29375 -0.04009  0.29497  1.30216 
      
      Coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
      (Intercept)  0.1279975  0.1059323   1.208    0.227    
      educ         0.0903658  0.0074680  12.100  < 2e-16 ***
      exper        0.0410089  0.0051965   7.892 1.77e-14 ***
      expersq     -0.0007136  0.0001158  -6.164 1.42e-09 ***
      ---
      Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
      
      Residual standard error: 0.4459 on 522 degrees of freedom
      Multiple R-squared:  0.3003,  Adjusted R-squared:  0.2963 
      F-statistic: 74.67 on 3 and 522 DF,  p-value: < 2.2e-16
  • Even with expersq, RESET points to a functional misspecicification of this regression model. Maybe, additional nonlinear terms or interaction terms are necessary for a correct specification, see Section 5.2

  1. To simplify, we suppress any intercepts and assume that the variables all have an expectation of zero.↩︎

  2. If you want, you can norm the coefficients of Q to one by diving the first equation by 1+t_1 and the second one by 1+t_2. Clearly, this doesn’t alter the solution set of the equations.↩︎

  3. Otherwise, we would have to employ an IV-estimator. For a general discussion of these issues and basic identification problems, see Chapter 7.↩︎

  4. Without intercept, because the intercept is part of the other explanatory variables; actually, you can include the intercept, but \hat \beta_0 would be zero anyway.↩︎